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Preface 


The  approach  of  Minsky  and  Papert  In  their  book  Perceptrona  [36]  provtded  the  motivation  for  this 
research.  Their  analysis  of  the  perceptron  Introduced  useful  mathematical  tools  for  understanding 
performance-limitations  of  'neural-based*  systems.  In  addition.  It  charted  and  quantified  these  limitations 
and  Identified  Important  areas  for  future  Investigation.  As  a  result,  the  book  Perceptrona  Identified  Issues 
of  learning  and  performance  that  have  continued  to  be  of  concern  to  Connections  researchers  even  now 
that  the  challenge  for  multi-level  learning  algorithms  has  to  some  extent,  been  answered.  The  author 
believes  that  the  mathematical  tools  developed  by  Papert  and  Minsky  will  themselves  be  useful  for  better 
understanding  of  connections  architectures.  In  the  author’s  view,  the  only  short-coming  of  the  work 
done  by  Minsky  and  Papert  (and  perhaps  Rosenblatt  as  well)  was  their  perspective.  They  treated  the 
perceptron  rrom  a  'computer*  polnt-of-vtew.  It  was  expected,  for  example,  to  determine  whether  or  not  a 
■retinal  object*  was  'connected*  even  when  the  olT-on  state  of  a  single  'pixel*  could  determine  the 
correct  answer. 

Most  certainly,  natural  perception-systems  don't  work  In  this  fashion.  Indeed,  they  must  determine 
the  connectivity  of  objects  despite  Inconsistencies  or  noise  In  the  Input-stimuli.  This  eliminates  the 
possibility  of  'computations*  whose  result  is  afTected  by  a  single  stimulus  element.  The  proper 
perspective  for  these  systems  In  the  author's  view  Is  a  probabilistic  one  In  which  the  system's  proper 
response  Is  characterlzable  In  some  way  but  is  robust  to  uncertain,  degraded.  Incomplete,  and  even 
Inconsistent  Information.  The  classifier  Identified  In  this  work  typifies  Just  such  a  system  and  the  forgone 
analysis  should  exemplify  the  proper  viewpoint  and  methods  for  future  Investigations  of  systems  of  this 
nature.  In  this  light,  this  work  will  have  been  of  merit  ir  It  has  Identified  Issues  valuable  to  future  efforts 
and  provides  methods  for  analyse  of  perceptuai/cognltlve  systems. 
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Chapter  1 
Introduction 

The  systems  under  consideration  are  an  outgrowth  of  work  done  on  self-organizing  automata  and 
perceptrons  [35.  38 1  and  later  work  In  parallel  associative  memories,  e.g.  [21.  40|.  Minsky  and  Papert 
In  [35[  had  carried  out  rather  extensive  mathematical  analysts  on  perceptions  revealing  Inherent 
limitations  In  the  classes  of  problems  they  could  solve.  These  systems  were  'learning*  automata  expected 
to  classify  Input  'stimuli*  based  on  their  past  experience  on  'training*  Inputs.  Minsky  and  Papert  showed 
that  multiple-stages  of  perceptrons  were  required  for  many  problems  of  Interest  yet  no  training  algorithm 
guaranteed  to  converge  to  a  solution  was  known  at  the  time  for  multi-level  systems.  They  concluded  In 
their  book  that  the  systems  held  little  promise  and  subsequent  Investigation  of  perceptrons  evaporated. 

Eventually  however,  with  more  powerful  computers  to  carry  out  simulations,  and  the  development 
of  several  multi-level  learning  algorithms  [0,  22,  38,  40,  ch.  6-8|,  descendant  ofTshoots  of  the  perceptron 
have  regained  Interest.  Currently  a  variety  of  these  automata  exist  and  are  known  by  names  such  as 
•Neural-nets',  'Parallel  Distributed  Processors'  (PDP  networks).  'Associative  Memories'.  They  are 
collectively  called  'connections  architectures*  and  have  been  studied  as  seif-organizing  memories  of 
perception  [28|  content-addressable  memories,  helrarchlcal  knowledge  bases,  and  classification 
systems  [5.  8|  models  of  human  'neural-computation*  [8.  18[  of  human  task  performance  and  attentions! 
learning  [41.  44]  speech  performance  and  natural  language  understanding  [13.  40,  cb.  18,  42|. 

These  and  other  efforts  have  led  to  guarded  optimism  for  the  future  of  connectlonlst  architectures  as 
knowledge  engines  or  as  models  of  human  Intelligence.  Capabilities  and  limitations  of  both  task  learning 
and  performance  have  been  demonstrated.*  However,  though  many  mathematical  Investigations  (e  g. 
Barto  [0|,  Golden  .15,  Ml.  Grossberg  [10,  18|,  Kohonen  [28|),  have  been  conducted,  Including  Information- 
capacity  studies  (see  Abu-Mostafa  [1.  2|.  Amlt  [3.  4|,  Keeler  [27|,  Little,  et.  al.  [32}.  McElIece.  et.  at.  [34]), 
there  is  much  room  for  development  of  analytical  understanding  of  the  capabilities  of  these  systems. 


*Good  introductory  articles  to  the  subject  include  tbe  books  [21,  40). 
•connectionis!*  or  ’neural-based*  systems,  see  [7,  40,  eh.  9|. 


For  an  introduction  to  tbe  mathematics  of 
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Develop  nt  of  connections  memory  systems  in  several  forms  has  changed  the  concept  of  memory 
from  »t or*gc  memory  to  what  the  author  calls  computational  memory.  Digital  and  other  local 
memories  are  examples  of  storage  memory  and  have  been  supplimented  by  the  distributed  overlayed 
memory  systems.  The  latter  have  more  complex  characteristics.  Interference  between  Items  stored  result 
in  the  capability  of  these  systems  to  implicitly  represent  the  regularities  relationships  among  the  items. 
Subsequently,  computation  and  storage  In  the  system  are  no  longer  distinct  processes  but  integral  aspects 
of  the  same  phenomenon.  These  systems  are  'Information  engines'  or  'computational  memory*  rather 
than  'Information  receptacles'. 


A  formulation  Is  needed  of  memory  as  a  general  mode  of  storage  and  computation.  An  Information- 
theoretic  approach  appears  most  natural  and  promises  to  Identify  the  essential  features  of  memory 
operation.  The  purpose  of  this  thesis  is  threefold: 

1.  Analytical  Model*:  A  germinal  characterization  of  memory  theory  will  be  presented.  The 
capabilities  and  limitations  of  any  memory  should  then  be  expressible  In  terms  of  information 
flow.  Resultant  information-theoretic  relations  will  provide  the  desired  means  of  analysis  and 
a  framework  for  understanding  any  particular  memory  system  as  a  member  of  the  general 
class  of  computational  systems. 

2.  Relevant  Iaauea:  Theory  in  l  is  used  to  identify  major  Issues  to  be  addressed  for  the 
understanding  of  storage  memory.  These  tssues  Include  Identification  or  'memory  tasks', 
amount  or  information  provided  by  the  memory  for  the  task,  amount  or  Information  required 
by  the  task  for  a  given  amount  of  storage,  the  maximum  number  of  items  storable  In  the 
system  with  respect  to  the  specified  task,  definition  of  memory  load,  memory  load  v.s. 
performance.  Identification  of  particular  tasks  useful  to  computation. 

3.  Evaluation  of  quantitative  performance:  Performance  of  the  assodator  with  respect  to 
Issues  identified  in  objective  2  is  quantified  utilizing  the  theory  from  objective  1.  First, 
storage-capacity  Is  evaluated  so  that  the  notion  of  'memory-load'  can  be  developed. 
Classification  capabilities  are  then  evaluated  as  the  memory-load  is  Increased.  Architectural 
considerations  and  hardware  tradeoffs  are  addressed,  as  well  as  performance  degradation  due 
to  the  Introduction  of  non-linearities  at  the  system-output.  Finally,  figures  of  merit  are  used 
to  compare  system  performance  with  the  optimal. 


It  Is  Intended  that  this  work  will  provide  the  proper  context  and  starting  point  for  further 
investigation  of  memory  as  a  computational  structure. 


1.1.  ■Neural-based*  systems 

Matrix  models  of  parallel  distributed  memories  were  derived  as  a  simplistic  model  of  brain  cell 

■-omputatlon  In  the  model,  the  output  of  each  cell  is  a  real  number,  y  representing  the  deviation  of  the 

cell's  firing  frequency  from  some  reference  frequency.  As  such,  y  can  be  negative  as  well  as  positive. 
The  Inputs  {/j.z,.  xn}  to  the  cell  are  similarly  real  valued  and  each  Input,  x.  has  an  associated 

coupling  strength  u\  to  the  cell  which  determines  the  effectiveness  of  that  Input  on  the  cell  output.  The 
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cell  determines  its  output  by  taking  the  weighted  average  of  the  Inputs. 

fl 

y  =  — V  W  X  . 
i—i 

where  (u;  ,  tr, . trn)  Is  called  the  cell's  ‘weight-vector*.  The  matrix  memory  Is  constructed  from  a 

collection  of  these  cells,  each  sampling  the  same  set  of  Inputs.  If  Is  the  number  of  Inputs  to  the 

memory  and  nQ  Is  the  number  of  cells  In  the  memory,  the  vector  x  s  . Zn(/))  of  lnpuu  when 

presented  to  the  Input  of  the  system  produces  an  output  vector,  y  =  (y,,y2 . yn(OJ^  ®lven  the 

relation  y  =  — Wc  where  W  Is  the  matrix  of  coupling  weights  w ..  connecting  the  1th  Input  to  the  Jth 
nI  * 

cell  [21.  281.  We  note  that  each  ‘cell*  or  ‘unit*  Is  merely  taking  the  dot-product  between  the  Input- 

vector  and  the  unit's  weight-vector. 

To  store  Information  In  this  system,  two  sets  of  vectors  called  the  Input  prototypes 

and  the  output  prototypes  {gj.gj . g^}  are  used.  For  each  Input  prototype  the  weights  cf  the 

system  are  adjusted  so  that  the  g_  vector  results  at  the  system  output  when  f  Is  presented  at  the 

to  rn 

Input.  The  system  Is  then  said  to  associate  f  with  g  .  For  each  m=l,2 . M .  the  matrix  that  Is 

m  to 

used  to  associate  f  with  g  (called  the  m‘h  association)  Is  the  outer-product  g  F  [21.  p.  18],  To 

m  TO  TO  TO 

store  the  M  associations,  these  Xf  matrices  are  added  to  obtain: 

M 

*v-  L  *JL  <>-■> 

m— 1 

The  Information  for  each  association  Is  distributed  over  the  whole  of  W  and  therefore  Is  overlaid  with  the 
Information  for  the  other  associations.  The  resulting  Interference  between  associations  Increases  with  A/, 
and  ultimately  limits  the  number  of  associations  storable  In  the  system. 

In  the  case  that  fj.fj . are  mutually  orthogonal,  no  Interference  exists.  When  tk  Is  Input  to 

the  system,  we  have* 


The  symbol  |  |  here  refers  to  the  'length*  of  t  vector  given  by  the  euclideui  norm. 
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=  *-1.2 . Af. 

The  matrix  produces  a  multiple  of  gt  when  tk  Is  present  at  the  Input.  If  the  tk  are  chosen  so  that 
jfj2  =  rij  then  gk  Is  reproduced  exactly  j8,  p.  804.  21,  p.  I8j. 

,  T 

We  will  be  concerned  with  the  case  that  the  Input  prototypes  are  not  orthogonal.  Noting  that  f  f 

m  k 

Is  the  dot-product  f ,  f  we  can  rewrite  the  product  W'F.  as 

*  m  m 


M 


wt„  =  E  !f»  fj«. 


m«! 


Now  the  dot-product  between  two  vectors  Is  a  measure  of  how  well  they  ‘match*  (assuming  all  vectors 
have  the  same  length).  The  product  IVf^  is  therefore  a  linear  combination  of  the  output-prototypes  with 

the  coefficient  of  g  being  proportional  to  how  well  f  matches  f .  ,  m=  1,2 . M  Since  the 

Input-prototype  that  best  matches  fk  Is  the  vector  Itself.  It  follows  that  the  output-prototype  that  has 
the  largest  coefficient  In  the  linear  combination  Is  the  vector  g4  .  In  the  chapters  that  follow,  the 
prototypes  will  be  chosen  randomly  In  such  a  way  that  they  will  be  very  nearly  orthogonal  to  each  other. 

Therefore,  the  dot-products  tk  tm  will  be  small  for  m  =  1.2 . A f.  m  ^  k.  This  means  that  as 

long  as  there  are  not  too  many  prototypes  stored  In  the  system,  f.-f  g,  will  be  the  dominant  term  In  the 

jutput  prototypes.  W'e  conclude  that  the  Unear-assoclator  can  be  seen  as  a 
particular,  it  produces  an  output  vector  that  Is  a  best  match  to  the  prototype 
st-matches  (from  among  all  the  Input-prototypes)  Is  present  at  the  Input, 
r  tput  vector  will  have  contributions  from  other  output  prototypes  and  so  Is  not 

a  strict  sense.  Wfhen  a  better  best-match  computation  Is  needed,  a  device 
Is  used. 

1.2.  Auto-association 


The  systems  described  above  are  called  ‘heteroassoclators*  because  the  ‘Input  prototypes*  are 
distinct  from  the  ‘output  prototypes*.  That  Is.  fm  ^  gm  .  In  fact  the  dimensionality  of  the  Input 
prototypes  may  differ  from  the  dimensionality  of  the  output  prototypes  as  seei  above.  An  ‘auto- 
assodator*  Is  similar  to  the  hetero-assodator  except  that  the  Input  and  output  dimensionalities  are  the 

same  as  are  the  Input  and  output  prototypes.  That  Is  f  3  g  m  =  l,  2 . A/.  .After  the  weights 

are  adjusted  Tor  storage  of  the  A /  associations,  retrieval  occurs  when  a  ‘damaged*  Input  Is  presented  to 
the  system.  The  ‘damage*  Is  due  to  noise  In  the  Input  signal  or  the  fact  that  the  Input  may  be  specified 
Incompletely  The  output  that  results  Is  passed  through  a  non-llnearlty  9.  40.  p.  91-95,  324-325,  to  limit 
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the  growth  of  the  size  of  the  vector  components.  The  output  will  be  a  better  rendition  of  the  proper  Input 
prototype  provided  the  matrix  Is  not  overloaded  (l.e.  provided  \f  Is  not  too  large). 

Since  the  output  Is  an  Improved  version  of  the  Input,  the  signal  can  be  fed  back  to  the  Input  of  the 
system  to  obtain  further  improvement.  The  process  Is  repeated  several  times  until  the  vector  stabilizes, 
the  result  Is  generally  a  highly  Improved  version  of  the  Initial  Input.  The  limitation  keeps  the  output 
vector  from  growing  without  limit  and  tends  to  force  It  to  stabilize  at  or  very  near  the  proper 
prototype  [6.  24|.  Variations  of  the  auto-assodator  Include  the  ’Hopfleld  net*  [23,  24.  25],  the  ’Braln- 
State-ln-a-Box’  or  *BSB’  model  0.  14|.  and  the  ’Boltzmann  Machine*  [22|. 

From  the  perspective  of  memory  systems,  the  difference  between  hetero-  assodators  and  auto- 
assodators  Is  that  for  the  latter,  the  Input  signal  provides  direct  Information  about  the  output.  In  the 
hetero-associator.  the  Input  serves  only  as  an  ’address’  or  ’approximate  address’  rrom  which  the  proper 
output  Is  to  be  retrieved.  The  auto-assoclator's  Input  Is  both  an  address  and  a  partial  specification  of  the 
proper  output.  In  any  event,  the  auto-assodator  produces  an  output  that  Is  the  prototype  that  best- 
matches  the  Input  vector.  The  algorithm  degrades  as  the  system  stores  more  prototypes  but  should  be  an 
Improvement  on  the  hetero-associator  for  the  same  storage  load. 

In  the  chapters  to  follow,  we  will  often  study  the  performance  of  a  best-match  algorithm  that  takes 
as  Its  Input  a  vector  produced  at  the  output  of  a  linear-assoclator.  The  best-match  algorithm  considered 
In  the  analysis  Is  arbitrary  but  could  Just  as  well  be  an  auto-assoclator.  The  auto-assoclator's  stored 
prototypes  would  be  Identical  to  the  llnear-associator's  stored  output-prototypes.  The  analysis  will  be 
concerned  with  the  conditions  under  which  the  Unear-assodator  (first-stage)  can  produce  an  output  vector 
’recognizable*  by  the  best-match  process  (second  stage).  The  best-match  algorithm  will  have 
•recognized*  the  output  of  the  Unear-assodator  If  the  algorithm  produces  the  output-prototype  of  the 
linear-assoclator  that  corresponds  to  the  input-prototype  of  the  assoclator  that  Is  most  similar  to  the 
assodator  s  Input  vector  (see  figure  1-1).  In  this  configuration,  the  combination  of  the  linear-assoclator 
and  the  best-match  algorithm  form  a  classifier.  The  linear- tssodator  ’translates*  the  Input  vectors  of  a 
form  similar  to  the  Input  prototypes  Into  a  form  similar  to  the  output-prototypes.  The  best-match 
algorithm  (possibly  an  auto-assodator)  then  selects  the  output  prototype  that  most  corresponds  to  the 
input  to  the  combined  system.  Each  Input  prototype  corresponds  to  a  vector  that  the  system  Is  most 
likely  to  ’see*  at  the  Input  or  that  Is  most  representative  of  a  class/category  of  Input  that  Is  Important  to 
the  system.  The  corresponding  output  prototype  constitutes  the  system  response  and  Is  of  a  form 
corresponding  with  the  system's  Internal  representation  of  the  category.  The  combined  system  produces  a 
particular  output  prototype  corresponding  to  the  category  to  which  the  system  Input  belongs.  Our 
concern  Is  with  tie  performance  of  the  Unear-assodator.  We  will  Identify  the  conditions  under  which  It 
will  produce  an  output  vector  of  high  enough  ’fidelity’  that  the  combined  system  can  categorize  Its  Input. 


Figure  1-1:  Llnear-assoclator  and  Best-Match  Classifier 


Proper  performance  In  this  configuration  Is  considered  a  minimal  requirement  on  the  llnear-assoclator  If  It 
Is  to  produce  output  "signals*  useful  to  subsequent  Information-processing  "stages". 

1.3.  Overview  of  Major  Issues 

1.3.1.  Tmaka  of  Computational  Memory 

The  llnear-assoclator  Is  an  example  of  "computational  memory".  As  opposed  to  local  memory  which 
Is  merely  an  Information  storage  device,  computational  memory  Is  characterized  as  an  Input-output  device 
that  can  respond  to  Inputs  that  are  not  explicitly  specified  during  storage.  Similarly,  the  system  can 
produce  outputs  not  explicitly  stored.  The  Information  stored  In  the  memory  Is  "overlaid"  In  the  sense 
that  all  Items  (associations)  stored  share  a  common  storage  medium,  resulting  In  between-ltem  Inte-actlon 
of  Information.  This  Interaction  causes  the  output  to  be  other  than  those  explicitly  trained  to  the 
memory.  Instead  the  output  Is  a  function  of  how  similar  the  Input  Is  to  the  trained  Inputs,  and  how 
similar  the  trained  associations  are  to  each  other.  This  and  the  fact  that  the  memory  can  respond  to 
novel  Inputs  results  In  a  memory  that  Is  capable  of  various  "memory  tasks*  during  retrieval. 

The  most  obvious  (and  mundane)  of  these  Is  "Item  memory".  For  this  task,  the  memory  Is  treated 
Just  as  a  local-storage  device  by  storing  associations  (f  .  g  ),  m  =  1.2 . M  and  subsequently  using 

fw  m 

f  as  an  "Input  address*  to  the  memory  which  In  turn  returns  Information  about  g_  as  "data*. 
Another  memory  task  Is  having  the  memory  system  distinguish  which  among  the  Af  output  prototypes. 
Is  the  one  that  matches  the  Input  prototype  present  at  the  Input.  Specifically,  one  first  stores  the 
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associations  If  .  g  ,  .)  where  <  Is  a  permutation  of  the  M  Indices  1.2 . A/.  One  of  the  Input 

prototypes,  say  tk  Is  then  presented  to  the  memory  resulting  In  an  output.  This  output  Is  compared  with 

all  the  output  prototypes  to  Identify  one  of  the  latter  as  a  best  match.  The  memory  Is  successful  at  the 

task  If  g  ,  .Is  the  prototype  chosen  as  the  best  match.  This  Is  called  •channel-memory*  since  the 
*lm) 

memory  acts  analogously  to  a  communication  channel.  Another  term  used  Is  'permutation  memory* 
Indicating  that  the  memory  acts  as  a  device  that  remembers  which  permutation  k  of  the  output 
prototypes  was  associated  to  the  Input  prototypes. 

Though  this  task  may  seem  artificial.  Its  consideration  serves  two  main  purposes.  First,  proper 
performance  of  this  task  Is  a  demonstration  that  the  memory  can  distinguish  the  associations  It  has 
stored.  If  a  system  has  stored  too  many  associations.  It  may  fall  this  task.  If  so.  It  Is  not  providing 
enough  Information  at  the  output  to  distinguish  which  prototype  output  was  'Intended*  as  the  output  of 
the  memory.  The  stipulation  that  the  memory  succeed  at  this  task  Is  a  minimal  requirement  called  the 
'channel-criterion*.  The  channel-criterion  Is  used  to  derive  upper  bounds  on  the  number  of  associations 
storable  In  the  memory. 

The  second  purpose  for  considering  the  matrix  as  a  channel-memory  Is  that  we  can  then  study  the 
system  performance  with  regard  to  the  task  of  'Input-classification*.  In  particular,  after  the  system  has 
stored  M  association  pairs  (f  .  g  ) .  non- prototype  vectors  are  allowed  at  the  memory  Input.  Assuming 
that  the  Input  Is  most  similar  to  the  prototype  tk  ,  we  will  call  the  Input  vector  fk'.  To  be  successful 
classifying  tk  ,  the  matrix  must  generate  an  output  that  Is  most  similar  to  gk  .  This  Is  Identical  to  the 
channel-memory  task  except  that  more  rreedom  Is  allowed  at  the  Input.  The  classification  task  Is 
Important  for  understanding  the  system's  ability  to  respond  to  a  vector  fk  that  Is  a  partial  or  degraded 
(say.  by  noise)  version  of  the  *lntended*  Input  tk  .  The  channel-criterion  again  provides  a  means  of 
specifying  limits  on  the  number  of  associations  storable  In  the  memory  for  proper  classification.  In  this 
case,  a  tradeoff  Is  quantified  between  the  number  of  associations  permitted  In  the  memory  versus  how 
■sloppy*  tk  can  be  as  a  rendition  of  tk  .  Consideration  or  the  classification  task  allows  one  to  Identify 
the  amount  of  Information  required  by  a  Unear-assoclator  to  classify  an  Input-vector  set  of  a  given  size 
Into  a  given  number  of  categories. 

The  classification  task  also  brings  up  the  Issue  of  the  reliability  of  the  Information  at  the  output  of 
the  memory  as  a  function  of  the  reliability  of  the  Information  presented  to  the  memory  Input.  This 
function  depends  on  the  number  of  associations  stored  In  the  memory.  Storing  more  Items  taxes  the 
memory  capability  and  so  requires  that  more  reliable  Information  be  present  at  the  Input  to  maintain  a 
given  output  reliability.  An  Important  Issue  Is  the  determination  of  conditions  necessary  for  the  output 
Information  of  the  memory  to  be  more  reliable  than  the  Input  Information.  Under  such  conditions,  the 
memory  could  effectively  suppllment  Incomplete/degraded  Input  Information  with  Its  own  stored 
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Information  to  provide  an  output  that  Is  more  complete/reliable.  The  memory  task  performed  would  be 
that  of  information  'enhancement*.  An  assoclator  performing  this  task  would  be  valuable  as  a  *rront 
end*  to  later  stages  of  assoclator  memories  or  processors  that  required  'high-grade*  Information  as  Input. 

Even  more  Intriguing  Is  the  possible  use  of  this  ^enhancement  memory*  to  Iteratively  Improve  the 
Information  It  receives  by  passing  the  received  Information  'through*  the  memory  several  times.  Using 
two  memory  systems  A  and  B .  one  stores  associations  (fm,  gm)  In  A  and  stores  their  inverse# 
It  f  )  in  B  One  then  sends  an  degraded  copy  t.'  of  f,  to  the  Input  of  memory  .4  .  The  output  of 

,4  Is  then  Input  to  B  whose  output  Is  then  fed  back  to  the  Input  of  A  .  The  process  Is  then  repeated.  If 

both  memories  are  'enhancement*  devices,  then  the  Information  that  Is  passed  back  and  forth  between 
^hem  should  Improve  with  each  pass  through  the  loop.  Using  the  theory  developed  In  this  here,  this 
possibility  could  be  explored  as  a  way  to  Improve  the  performance  of  enhancement  memories  that  have 
stored  a  given  number  of  associations. 

A  final  note  concerning  memory  tasks  Is  that  they  Identify  modes  of  'computation*  that  may  serve 
as  design  tools  for  the  architecture  of  connections  'knowledge  engines*. 

1.3.2.  Characterization  of  Memory 

Another  Important  consideration  Is  the  definition  of  the  'storage*  of  the  memory.  That  Is.  defining 
the  amount  of  Information  'contained*  by  the  memory  that  Is  useful  for  retrieval.  In  particular,  once  M 

associations  are  stored,  we  consider  the  matrix  f  whose  columns  are  the  Input  prototype-vectors 

f  ,f„.  •  f„  and  the  matrix  g  whose  columns  are  likewise  the  output-prototypes.  For  Item  memory 

discussed  In  the  last  section,  the  storage  of  the  memory  will  be  defined  as  the  information  that  the  matrix 
f  provides  about  the  matrix  g  via  the  memory.  The  question  arises  as  to  whether  this  Is  equal  to  the 

•Item-Information*  which  Is  simply  the  sum  over  m  =  1.2 . M  of  the  Information  that  tm  provides 

about  g  via  the  memory.  This  work  Indicates  an  answer  In  the  negative  for  linear-associative  Item- 
memory,  under  most  conditions  However,  channel-memory  does  have  this  feature,  again  under  most 
conditions.  A  memory  having  this  feature  will  be  called  'Item-accessible*  meaning  that  essentially  all  the 
Information  that  f  provides  about  g  via  the  memory  can  be  retrieved  'Item-by-ltem*.  Like  digital 
RAM  memory  (local  storage),  one  can  apply  one  Input  prototype  at  a  time  to  the  Input  of  the  memory 
and  record  the  matrix  output  to  retrieve  all  the  Information  about  g  .  In  fact,  the  Information  retrieved  In 
this  way  Is  virtually  non-redundant. 

Characterization  of  memory  as  Item-accessible  allows  upper  bounds  to  be  derived  for  the 
Information  retrievable  from  the  application  of  a  single  Input  vector  (called  a  single  'access*).  Since  the 
system  Is  symmetrically  or  uniformly  defined  over  Its  Input  prototypes  f  f,,  .  .  .  .f ^ ,  the  Information 
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retrievable  on  applying  any  of  these  to  the  Input  Is  the  same.  From  this  It  follows  that  the  memory 
storage  Is  Just  M  times  the  amount  of  Information  retrievable  from  a  single  access  Just  as  Is  the  case  for 
local  memory.  The  bounds  that  will  be  derived  for  the  memory  storage  can  thereby  be  mapped  Into 
bounds  on  the  amount  of  Information  retrievable  tor  a  single  memory  access.  Even  for  memory  that  is 
not  Item  accessible  however,  the  single-access  bound  will  still  hold.  The  difference  Is  that  the  Information 
retrieved  by  applying  the  M  Input  vectors  In  sequence  may  ‘overlap*  (redundancy)  and  as  a  result  will 
not  completely  specify  g  .  We  will  characterize  memory  and  address  these  Issues  after  basic  notions  or 
Information  theory  are  Introduced  In  the  next  chapter. 

1.4.  Methods  and  Focus  of  the  Investigation 

This  Investigation  views  the  asymptotic  performance  of  the  llnear-assoclator.  That  Is.  we  examine 
the  capabilities  of  the  systems  as  they  are  allowed  to  get  arbitrarily  large.  This  will  allow  us  to  ascertain 
how  well  their  performance  scales  with  system  size.  Large  systems  benefit  from  the  high  dlmlnslonallty  of 
their  Input/output  signals  and  so  perform  better.  Larger  systems  will  therefore  be  most  useful  In 
memory /classification  tasks  and  deserve  the  emphasis  provided  In  this  work. 

The  work  Is  confined  to  finding  upper  bounds  for  system  performance,  though  an  effort  is  made  to 
keep  the  bounds  tight.  Approximations  are  used  extensively,  but  are  accurate  tor  the  range  of  parameter- 
values  considered.  The  approximations  pertain  particularly  well  to  large-scale  systems,  with  a 
correspondingly  large  number  of  associations  stored.  Pushing  the  lower  limits  of  system  size  that  the 
theory  will  accomodate,  a  system  should  have  Input/output  dimensionalities  of  say  60  or  100  and  at  least 
5000  weights.  The  number  of  associations  should  be  at  least  8  or  10  times  the  larger  of  the  Input/output 
dimensionalities,  but  generally  no  more  than  the  number  of  weights  In  the  system.  More  typically 
however,  the  Input/output  dimensionalities  are  taken  to  be  at  least  several  hundred  each,  and  the  number 
of  Items  stored  should  be  at  least  25,000-60.000.  The  number  of  weights  should  generally  be  twice  the 
number  of  stored  associations  or  more. 

In  this  work,  an  attempt  has  been  made  throughout  to  make  explicit  the  range  of  applicability  of 
the  theory.  The  reader  Is  advised  to  note  parameter-value  restrictions/assumptions  made  In  what  follows. 
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Chapter  2 

Definitions,  Identities  and  Notation 

Before  the  presentation  of  memory  theory,  some  preliminary  material  must  be  presented  concerning 
the  notation  used  and  relationships  that  hold  among  Information-theoretic  quantities  considered.  More 
background  concerning  concepts  of  Information  theory  can  be  found  In  texts  (8,  12.  33|. 

2.1.  General  Relations  of  Information  Theory 

Unless  otherwise  stated,  capltol  letters  always  symbolize  random  variables  whereas  lowercase  letters 
symbolize  a  specific  value  or  random-variable  outcome.  Scrlpt-capltois  represent  sample-spaces.  Within 
this  convention,  boldface  unsubscripted  letters  represent  matrices  whereas  boldface  subscripted  variables 
represent  vectors.  The  letters  W.  F.  G  for  Instance,  are  random  matrices:  "W,  J.  S  are  their  respective 
sample-spaces:  w.  f.  g .  represent  respectively  specific  outcomes  rrom  each  sample-space.  Similarly 
F  ,  G  are  random  vectors  with  respective  outcomes  f  ,  g  The  abbreviation  *r.v.*  will  be 

w  ffi  fn  in 

frequently  used  for  'random  variable*  and  the  abbreviation  'l.l.d.'  will  be  used  for  ‘Independent, 
Identically-distributed*  when  this  condition  applies  to  a  random  variable.  The  'equivalence  sign*.  *  s  * 
win  be  used  to  denote  'equality  by  definition*  or  the  equivalence  of  two  random  variables.  The  random 
variables  In  this  work  are  discrete  with  finite  sample-spaces  unless  otherwise  stated. 

If  X  Is  the  sample  space  for  the  r.v.  X  and  for  any  z  6  X,  P(X  =  z)  Is  the  probability  that 
A*  =  z  then  the  entropy  of  .Y  denoted  H(X)  Is  defined  as 

H(X)  =  -  HA'  =  z)log2  F\ A'=z) 

*€  z 

If  we  define  p(z)  s  F\ X  =  z)  then 

H{\1  =  -  y*  p(z)log0pir)  (2.1) 

*  €  r 

Heurlstlcally .  Hi. Y)  Is  the  average  taken  over  all  outcomes  of  A',  of  the  minimum  number  of  yes.no 
juestlons  required  to  determine  the  outcome  of  A'  (see  sections  of  |8,  12.  331  relevant  to  Huffman  coding). 
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We  call  H(X)  the  uncertainty  of  X .  the  information  content  of  X  or  the  information 

represented  by  X  since  It  is  the  average  amount  of  Information  required  to  determine  X  . 

When  considering  two  random  variables  X.  Y  the  conditional  entropy  of  -V  given  Y  Is  given  by 

H(X  |  V)  =  -  £  ^  P[X=x.  >'=  yjlog,  P[X  =  *  |  Y  =  y) 

*€  r y€  y 

where  Z  and  y  are  the  respective  sample  spaces  of  X  and  Y .  This  entropy  can  also  be  written 

H(X\Y)  =  -52  H(X  I  Y  =  y)P(Y=  y) 

yey 

where  H(X  \  Y  =  y)  =  -£x  €  ,  P{X  =  2,  Y=  y) log,  *(*  =  x  |  Y=  y) 


The  definition  of  entropy  can  be  extended  to  n-tuples  of  r.v.’s  X  m  lX.,Xn,  .  .  .  X  )  ■  Examination  or 
definition  (2.1)  reveals  that  H(X)  Is  not  a  function  of  the  outcomes  of  X  but  of  the  probability  function 
defined  on  those  outcomes.  In  particular.  X  In  equation  (2.1)  could  be  the  vector-valued  r.v.  X  or  a 

fl 

matrix-valued  r.v.  X  .  If  the  probability  function  P  Is  defined  over  the  sample  space  Z  of  X  then 

fl  fl  fl 

substitution  of  P  for  P  In  equation  (2.1)  gives 

fl 


H(XvX2 


-  £  Pn(Xvx2 —  =  xjiogj  Pn{Xvx2. 


jr,  =  *) 


x  €  X 


Note  that  x  €  Zn  Implies  that  x  Is  an  n-dlmenslonal  vector  whose  »‘b  component  Is  a  possible  outcome 

or  Xi .  If  YyY2 . Y  Is  an  m-tuple  of  r.v.'s.  then  we  can  extend  the  definition  of  conditional  entropy 

to  Include  H(XyX2 . Xn  |  Yj.Yj,  .  .  .  ,Ym)  which  Is  the  entropy  of  XyX2,  ■  ■  ■  ~Xn  conditioned  on 

Yj.Yj . Ym  (see  [8.  12.  33|).  The  Important  relationships  are 


»•  mxyx2 . *,_,)  <  H(XvX2 . ,Xn)  <  52  H(X.)  (2.2) 

1—1 

where  equality  holds  between  the  nrst  and  middle  terms  ir  and  only  If  there  Is  a  function  /  so 

that  An  =  f(X{.  .Yj . w,lfl  Probability  one.  Equality  holds  between  the  second 

and  third  terms  If  and  only  if  the  .Y.  's  are  mutually  Independent. 


WY.  Y! . w, . YJ  S  . .v„)ir,.y. . rm_,) 


with  equality  If  and  only  If  A',. A’, . .Y  are  Independent  of  Y  whenever  the  outcomes  of 

1  2  ft  ffl 

Y_  ,  are  Known. 

1  •  to—  1 
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3  H(\\.x2 . x*  |  >;>; . rj  >  o 

with  equality  If  and  only  If  A'j  .  A'„ . Xn  are  completely  determined  by  of  Vj.}',.  ,>m  . 

that  Is.  for  each  i  =  1.2 . n  there  is  a  function  /.  such  that  AV  =  fiO\.Yn . Ym) 

with  probability  one. 


Relation  (2.4)  holds  when  m  =0  .  that  Is 

mxyx2 . ,Xn)  >  0  (2.5) 

Particular  Inequalities  Implied  by  these  relations  are  of  concern,  such  as 

o  <  H(X  |  Y)  <  H(X)  <  H(X.Y)  <  mX)  +  H(Y)  (2.8) 

Equality  holds  respectively  In  each  of  the  above  Inequalities  If  and  only  ir  X  =  f{Y)  with  probability  one; 
X  and  Y  are  Independent:  Y  =  /( X)  with  probability  one:  X  and  Y  are  Independent.  Finally  since 
we  are  only  considering  only  discrete  r.v.'s.  for  any  deterministic  function  /( x)  we  have 

mnx))  <  Hix)  wnx)\Y)  <  f/(X|r)  (2.7) 

mux)  1  x)  =  0  (2.8) 

H(Y\f(X))  >  H(r|Al  (2.0) 

As  remarked  earlier,  the  entropy  functions  are  functions  of  probability  functions  defined  over  sample 
spaces.  Therefore  the  relations  above  hold  even  If  the  r.v.'s  that  appear  In  the  expressions  are  scalar, 
vector,  or  matrix  valued. 

The  averKge  mutual  Information  (or  briefly  ‘mutual  information*)  between  A'  and  Y  denoted 
as  I(X  ;  V)  can  now  be  defined 

HX  ;  Y)  =  H(X)  -  H[X  \  Y)  (2.10) 

It  can  be  shown  12;  that  I{X  ;  VT  Is  symmetric  In  Its  arguments  so  that  l(X ;  VI  =  I(Y;  X)  .  From  this 


we  also  have 
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AX  ;  V,  -=  H(Y)  -  H(Y\  X) 

Also  by  equation  (2.8)  we  have 

KX-.Y)  >  0  (2.11) 

with  I(X  ;  Y)  =  0  If  and  only  If  X  and  Y  are  Independent. 

Consider  again  to  the  yes/no-questlon  heuristic  for  guessing  the  value  of  X .  Knowledge  of  >'  Is 

the  equivalent  of  being  provided  answers  to  some  of  the  questions  required  to  determine  X  .  This 

subsequently  reduces  the  number  of  questions  needed.  The  reduction  given  Is  precisely  the  uncertainty  of 
X  before  Y  Is  known  minus  the  uncertainty  of  X  after  Y  Is  known  (t.e.  Identity  (2.10)).  We  call  this 
the  Information  Y  provides  about  X.  By  symmetry,  this  Is  also  the  Information  X  provides  about  Y. 
As  indicated  In  the  previous  paragraph,  r.v.'s  X  and  Y  provide  no  Information  about  each  other  If  and 
only  If  they  are  independent. 

If  /  Is  a  deterministic  function  defined  on  the  sample  space  Z  of  X  then  then  H(f[X)  |  X)  =  0 
and  so 

AX  ;  f(X))  —  H(f(X))  (2.12) 

That  Is.  the  Information  X  provides  about  f[X)  Is  precisely  the  Information  represented  by  /(X) .  For 

any  other  r.v.,  Y,  we  have  that  ff(Y|  /(X))  >  H(Y\  /(X),  X)  =  H(Y\  X)  which  Implies 

I(Y\ /(X))  =  H(Y)  -  H{Y\f[X))  <  H(Y)  -  H(Y |  X)  and  we  have 

KY-.f(X))  <  KY-.X)  (2.13) 


The  concept  of  mutual  Information  can  be  extended  In  ways  analogous  to  the  extensions  of  entropy 
outlined  above.  Two  extensions  concern  us.  First,  the  Information  I{X ;  Y.  Z)  that  two  r.v.s  Y  and  Z 
Jointly  provide  about  the  r.v.  X  Is  defined  by  considering  the  pair  (Y,  Z)  as  a  single  r.v.  replacing  the 
Y  term  In  equation  (2.10) 

AX  ;  Y.  Z)  =  /AX)  -  H(X  |  Y.  Z)  (2.14) 

Second,  the  Information  /(X  ;  y|  Z)  that  Y  provides  about  X  when  Z  Is  known  Is  derived  from  the 
equation  for  AX’;  Y)  by  conditioning  the  entropies  In  (2.10)  on  Z 
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1{X  :Y\Z}  =  H(X  |  Z)  -  H(X  |  Y.  Z) 


A  useful  relation  between  I(X  ;  Y.  Z)  and  /( X  :Y\  Z)  Is 


t(X  ;  Y.  Z)  =  HX  :Y\  Z)  +  I(X :  Z) 


(2.10) 


This  can  readily  be  shown  by  substituting  for  each  term  above  Its  definition  as  a  function  of  entropy. 

We  also  need  a  fact  used  later  about  Joint  dependence.  If  W  is  a  function  of  two  r.v.'s  X  and  Y 
Jointly  it  is  possible  that  W  Is  Independent  of  each  of  X  and  V  singly.  That  Is 


I(W;X.Y)  =  H(W) 


(2.17) 


I(W:  X)  =  Q  I(W  \  >1  =  0 


(2.18) 


An  example  Is  where  X  and  Y  are  lndependent-identlcally-dlstrlbuted  (1.1. d.)  r.Vs;  each  takes  values 
±  1  with  probability  1/2  that  either  value  occurs.  If  W  3  XY.  no  Information  Is  conveyed  about  the 
outcome  of  W  given  only  the  outcome  of  X  or  given  only  the  outcome  or  Y . 

2.2.  Specific  Notation  and  Relations  Required 


2.3.1.  Notation  for  Seta  and  r.v.  Distributions 

The  symbol.  R  ,  win  be  used  In  reference  to  the  real-numbers.  When  speaking  of  a  sequence  of  N 

entitles  a  .  n  =  l.  2 . N .  we  will  sometimes  use  the  notation  {o  ,  .  For  Infinite  sequences,  we 

substitute  'co*  for  N .  Now  let  ,  be  a  sequence  of  1.1. d.  Bernoulli  r.v.'s  [30.  p.  lfll],  taking 

values  a.  b  g  R  with  probabilities  p  and  (1  —  p )  respectively.  If  Y  Is  the  sum  of  the  first  n 

fl 

Bernoulli  r.v.  s.  then  Yn  Is  a  binomial  r.v.  (30,  p.  183|  and  we  say  Y  Is  •  Bin(a,b,p,n)  •  or  more 
concisely,  we  put  Yn  Bin(a,b,p,n) .  if  o  =  l,  6  =  —  l  .  and  p=  1/2,  then  we  put 

Bin(± i,  1/2, n)  .  Notice  that  In  this  case,  the  variance  of  Y  Is  n  .  For  a  normal  r.v.,  X  with 

fl 

mean  u  and  variance  a3,  we  put  ,Y  ~  N(p,  a3) .  A  normal  r.v.  with  zero-mean  and  unit-variance  Is 
called  a  atandard  normal  r.v.  and  •  f  •  denotes  the  standard  normal  distribution  function.  The 
mean  of  an  arbitrary  r.v.  „Y  Is  denoted  by  EX  and  the  variance  by  VAR  X  .  The  term,  random.  Is 
used  to  refer  to  selection  of  an  outcome  of  a  uniform  r.v.  over  a  particular  sample-space.  The  term 
reliably  refers  to  an  outcome  or  class  of  outcomes  that  occur  with  probability  near  one  or  with 
probability  approaching  unity  as  some  relevant  parameter  gets  large. 
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Most  of  the  random  vectors  we  consider  will  consist  of  ±  I  s  for  components.  We  win  call  such 
vectors  ±l-vectora  or  bit-vectori  since  the  components  are  binary.  The  set  of  n-dlmenslonal  bit- 
vectors  Is  sometimes  denoted  by  {  — l,  l}"  and  often  referred  to  as  a  ‘space*  even  though  the  set  is  not  a 

proper  vector-space  over  the  real  or  complex  numbers.  If  X  =  (AT,,  X. . X  )  Is  a  random  vector 

whose  components  .Y(  i  =  l.  2 . n  are  i.i.d.  each  taking  only  the  values  ±  l  .  then  X  Is  called  a 

Bernoulli  vector.  For  the  case  that  each  of  the  two  values  ±  l  is  taken  with  probability  1.  2.  the 
vector  X  Is  called  a  balanced- Bernoulli  vector.  Note  that  choosing  an  n-dlmenslonal  balanced- 
Bernoulll  vector  Is  the  same  as  choosing  a  vector  at  random  from  the  n-dlmenslonal  space  of  bit-vectors. 

2.2.2.  Notations  for  Prototype-Vectors  and  the  Assoclator  Matrix 

The  vectors  fj.fj . fM  and  the  vectors  gj.g^ . gw  will  be  considered  as  outcomes  of 

random  Input-vectors  F1.F2 . FM  and  random  output-vectors  Gj.Gj . respectively.  The 

Fm  s  will  be  called  Input-prototypes  and  the  Gm 's  will  be  called  output-prototypes.  These  vectors 
are  assumed  to  be  balanced-Bernouili  vectors  with  nt  as  the  dimensionality  or  the  Input- prototypes  and 
no  15  the  dlm«nsl°nallty  of  the  output-prototypes.  We  also  form  the  random  matrix  F  whose  columns 

are  Fj.F2 . In  Index-order.  Similarly,  we  form  the  matrix  G  from  the  output  prototypes.  The 

symbols  f  and  g  of  course  denote  particular  matrix-valued  outcomes  or  F  and  G  respectively.  The 
storage  equation  (1.1 )  becomes 

M 

W  =  H  GmFm  (2.10) 

In  terms  of  the  random  prototype-vectors.  This  can  be  expressed  more  concisely  In  terms  of  the  matrices 
F  and  G  : 


w  =  gft 


For  retrieval,  we  form  the  matrix  G*  whose  columns  G’4  are  given  by 

G'.  =  wr.  =  E  <G-Or.  -  £ 

ma| 

or.  In  terms  of  the  matrices 


G’  =  WF  =  GFtF 


(2.20) 


(2.21) 


(2.22 


) 


L 
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Another  form  of  storage  Is  called  channel-memory  or  permutation  memory  In  this  ease,  the 
output  prototypes  are  considered  to  be  known  the  retrieval  device  (later  called  the  detector)  and  therefore 

will  be  denoted  as  specific  outcomes  g,-g2 . gw-  The  Input-prototypes  Fj.F, . Ftf  will  still  be 

considered  as  random  vectors.  In  addition,  we  will  have  need  for  the  r.v.  K  whose  outcome  <  Is  one  of 

A i\  permutations  of  the  Indeces  {1.2 . A/}.  That  Is,  k  Is  a  function  that  maps  any 

m  €  {l.2 . V/ }  to  a  unique  value  ic(m)  from  the  same  set.  This  permutation  Is  to  be  applied  to 

the  columns  grg2 . gv  of  the  g-matrlx  to  produce  the  matrix  x(m)  whose  columns  are 

g^^,g^2j . gc(Vfl.  When  considering  the  outcome  k  of  K  as  undetermined,  we  denote  by  K(m) 

the  r.v.  whose  outcome  Is  the  value  x(m)  .  The  random  matrix  that  results  when  k  Is  applied  to  g  Is 
denoted  by  *(g) .  Under  these  conventions  the  storage  equation  for  permutation  storage  Is 

Si 

W  =  £  «K(m)Fm  (2-23) 

m— 1 

or  more  concisely 

W  =  K(g)  Fr  (2.24) 

one  says  that  the  permutation  K  Is  stored  In  the  memory. 


2.3.  Probabilistic  Analysis  of  Sums 


2.3.1.  Distribution  of  Sums 


Using  the  rightmost  sum  In  equation  (2.21).  we  can  write  the  expression  for  the  /h  component 
C h.  of  the  random  vector 


Si 


<7  =  Y  (F  FJG  . 

*1  m  k'  mj 


Si 


iF.  F.»C.,  *  L 

m—  I;  yi  k 


Si 


niGkJ 


E(F  F.)G 

m  k 

"I— 1.  yk  k 


fn; 


(2.25) 
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To  extend  the  definition,  we  will  have  need  for  calculating  the  mean,  variance,  and  entropy  of  such  a  sum. 
For  this  It  will  be  useful  to  understand  the  Independence  of  the  terms  under  the  summation. 

To  start.  If  X.  Y ,  Z  are  n-dlmenslonal  balanced  Bernoulll-vectors  with  respective  components 
X ■ .  Y .  Z .  ,  then  the  dot-products  X  •  Y  and  X  •  Z  are  Independent.  This  follows  from  the  Tact  that 

t  I  I 

the  products  X-  •  K  and  X.  ■  Zi  are  Independent  of  their  respective  factors.  In  fact,  this  Implies  that 

X  Y  Is  Independent  of  X  when  Y  Is  not  known  and  vice-versa.  Since  the  Input-prototypes  are 

balanced  Bernoulli  vectors,  the  dot  products  F^F^  and  Fm  ■  F^  are  Independent  when  m'  jL  m  . 

Also  the  components  of  G  are  Independent  so  the  terms  (P  •  FJG  .  In  (2.25)  are  mutually 

m  I  m  j 

Independent. 

Because  of  this  Independence,  the  variance  of  the  sum  Is  the  sum  of  the  variances  of  the  summed 
terms.  Furthermore,  If  two  r.v.'s  are  Independent  with  zero  mean,  then  the  variance  of  the  product  Is  the 
product  of  the  variances.  For  each  component  X.  of  an  n-dlmenslonal  balanced  Bernoulli  vector  X  .  the 
mean  EX.  Is  zero  and  the  variance  Is  one.  Therefore,  If  Y  Is  an  independent  n-dlmenslonal  Bernoulli 
vector  the  variance  VAR  (Xi  Y.)  Is  Just  (VAR  X^VAR  Y.)  =  1  .  From  this  we  have  the  variance 

*i  •» 

VAR  (X  •  Y)  =  VAR(  J2Xi  Yi  )  =  ^VAR(X.Y.)  =  n 

i— l  *<-l 


From  this  we  see  that  VAR(  F_  •  F.)  Is  n.  when  m  ni  k .  Since  the  mean  of  G  .Is  zero  and 

rn  m  i  '  mj 

the  variance  Is  one.  we  also  have  that  the  variance  of  (F  •  F.)  G  .  Is  n. .  These  terms  In  the  sum  of 

m  m  mj  i 

(2.25)  are  Independent  and  there  are  M—\  of  them  so  the  variance  of  the  sum  Is  (A/— l)r*j. 
Considering  the  mean  and  variance  of  the  n^G^.  term  as  well,  we  find  that  the  mean  of  Ch.  Is  zero  and 
the  variance  Is  A/n^ .  The  distribution  of  the  sum  on  the  right-hand  side  of  (2.25)  Is 
flin(ii,  1/2,  M  rij)  which  Is  roughly  normal.  Considering  the  term  again,  we  see  that  It  takes 

values  ±n.j  with  equal  probability.  We  conclude  that  Gk.  Is  blmodal.  each  mode  having  a  roughly 
normal  distribution.  Since  M  —  1  <=»  Af  for  large  values  of  Af  the  variance  of  each  mode  Is  taken  to  be 
Mnf.  Methods  such  as  this  are  used  In  the  chapter  on  classification  to  determine  the  distribution  of 
sums. 

2.3.2.  Binomial  Entropy 

Another  consideration  Is  the  entropy  H(S  )  of  a  sum  5  of  n  balanced  Bernoulli  r.v.'s 

n  n 

Xi  i  =  1.2 . n  In  the  appendix  It  Is  shown  that 


WlS  )  =  (1, 2)log  (*en,  2) 

n  • 


(2.28) 
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Briefly  the  result  Is  obtained  as  follows.  First  define  a  standard  Bernoulli  r.v.  to  be  a  r.v.  that  takes 

the  value  one  with  probability  1/2  and  the  value  zero  with  probability  1/2.  The  sum  S'  of  n  standard 

Bernoulli  r.v.  s  Is  blnomlally  distributed  and  takes  on  values  a  '  that  are  In  one-to-one  correspondence 

with  the  possible  values  a  of  the  sum  S_  .  To  see  this,  note  that  the  number  a  '  Is  the  number  of 

summands  of  S  '  whose  value  Is  one.  When  the  number  of  1-valued  summands  of  Sn  Is  a*  there  will 

be  n  —  a  '  minus- 1-valued  summands  of  S_  .  The  value  of  S  will  therefore  be  a  m  n  —  2a  ' ,  This 
ft  ™  ii  n  n 

can  also  be  written  a  '  =  (n  -  a  )/ 2  completing  the  correspondence. 

fi  n 

Under  the  one-to-one  correspondence,  S_,  and  S'  have  equivalent  probability  distributions  and  so 

have  the  same  entropy.  Since  the  probability  distribution  of  S  '  Is  determined  by  the  binomial 

coefficients,  we  find  the  entropy  of  SJ  to  get  the  entropy  of  Sn  .  Note  that  S^'  Is  blnomlally 

distributed  and  so  Is  approximately  normal  with  variance  n/4  .  One  might  expect  that  the  entropy  or 

S  ’  is  approximately  the  same  as  that  of  a  normal  r.v.  with  the  same  variance.  Appendix  A  shows  that 
1% 

this  Is  In  fact  true.  That  Is,  the  entropy  of  SJ  Is  roughly  (l/2)log.  (iren/2)  where  the  approximation 
approaches  perfection  as  n  gets  large.  This  of  course  Implies  that  the  entropy  of  Sn  Is 
(l/2)log2  (nen/2) . 

It  Is  useful  to  note  that  although  Sn  Is  roughly  norm&l  with  variance  n  ,  It  does  not  have  the  same 
entropy  as  a  normal  r.v.  with  the  same  variance.  Such  a  normal  r.v.  would  have  entropy 
(I/2)log.  (2)ren)  =  (l/2)log  (xen/2)  +  I  which  Is  l  bit  larger  than  the  actual  entropy  of  5  .  This 
discrepancy  Is  due  to  the  Tact  that  we  can  multiply  a  discrete  r.v.  by  any  Tactor  thereby  changing  Its 
variance  without  changing  Its  entropy.  There  Is  no  strict  correspondence  between  the  variance  and  the 
entropy  for  discrete  r.v.'s. 


2.4.  Special  Functions 


An  entropy  function  of  particular  Interest  Is  the  binary  entropy  function  M(p).  Let  X  be  a  r.v. 
with  two  outcomes  Zj  and  z2  and  probability  p  that  Zj  oecurs  and  probability  1  -  p  that  z2  occurs.  Then 

^(p)  s  H{X)  =  -plog2  p  -  (l  -  p)log2  (i  -  p) .  0  <  p  <  1  (2.27) 

Here  0 )  Is  taken  to  be  llm  )1(p)  =  0.  The  function  Is  continuous  over  the  Interval  '0,  l]  and 

p  —  o 

differentiable  on  (0.  I).3  It  Is  strictly  Increasing  on  [0,  1/2)  and  strictly  decreasing  on  [1/2,  l].  By  taking 


For  real  numbers  •  <  4 .  the  open  interval  la,  b)  is  the  eel  of  real  number*  between  e  and  b  excluding  the 
endpoint*.  The  doted  interval  |a,  &|  includes  the  endpoint*. 


10 


che  Taylor  series  expansion  of  H(z)  about  z  =  1/2  and  truncating  one  can  get  an  approximation  of  i'(x) 
for  z  1/2.  We  also  approximate  #(x)  for  z  near  0  In  the  same  manner.  These  approximations  are: 


X(z)  sa  1  -  (2log2  e)(z  -  1/2)2 

| z  —  l/2|  <  0.38  Implies  error  <  \0% 

(2.28) 

I  -  )t(z)  =  (2iog,  e)(z  -  1/2)2 

same  error  as  above 

(2.20) 

1  z 

<P{X)  <53  -  H - 

2  vr* 

1*1  <  1 

(2.30) 

Measuring  Similarity 

Just  as  storage  of  Information  Is  attributed  to  a  'memory  device*  retrieval  of  the  Information  Is 
attributed  to  a  'detection  device*  or  detector.  Both  the  memory  and  detector  are  characterized  as 
mathematical  processes.  A  particular  mathematical  process  for  the  detector  Is  that  of  measuring 
similarity  between  two  vectors  as  Is  the  case  when  the  detector  Is  a  best-match  process.  The  Information 
retrievable  by  the  detector  will  depend  upon  the  similarity  measure  employed.  Therefore,  the  performance 
of  a  system  must  be  defined  with  respect  to  a  particular  similarity  measure.  We  will  define  a  first  order 
similarity  measure  by  way  of  the  Hammlng-dlstanee  function. 

Definition  li  Define  {— l,l}*  to  be  the  set  {x  €  R"  |  *•  6  { — 1.1 }.  »=1,2 n}. 

The  Hamming-dlstance  between  two  vectors  Is  the  function  //!>:{— 1,1  }"  X  {—1,1}"  -*  R 
given  by  HD(x.y)  =  jE?-,  |x, .  - 

The  Hamming  Distance  is  the  number  of  components  at  which  x  and  y  disagree.  Its  negative  Is  a 
prototypical  similarity  measure  on  {-l,l}"  from  which  the  componentwise  similarity  measure  Is  defined. 

Definition  2>  Componentwise  Similarity  Measures  If  V  Is  an  n-dlmen.ional 
vector-space,  then  a  (componentwise)  similarity  measure  Is  a  function  S:VxV-»  R 
having  the  following  properties: 

1.  Symmetry!  For  all  x.y  6  V,  we  have  S(x.y)  =  S(yjt) . 

2.  Reflexlvely-Maxlmisedi  For  x.  y  6  (x  6  V|  {x|  =  l }  ,  S(x.y)  Is  maximized  by 
x  =  y 

3.  Hamming-Consistencyt  For  vectors  x.  y,  w,  s  €  {-l.l}",  the  Hammlng-dlstance 
Inequality  - HD{x,y )  <  -HD( w,  s)  Implies  5(x,  y)  <  5(w,  s) . 
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i.  Flrat-Order  Invariant!  If  k  Is  a  permutation  of  tbe  Indices  1.2 . n  and  x(x)  Is 

the  vector  whose  components  are  the  components  of  z  permuted  by  k  then 
S(x.  y)  =  5(<(x).  «(y)) . 


Under  this  type  of  similarity,  x  and  y  are  to  said  to  be  more  similar  than  w,  •  whenever 
S(x.y)  <  S(w,») .  Condition  3  requires  the  similarity  measure  to  be  consistent  with  the  negative 
Hammlng-dlstance  similarity,  —HD(x.  y)  on  {-l.l}*.  We  allow  the  word  •minimized*  to  be  replaced  by 
•maximized*  In  2  provided  that  the  second  Inequality  In  3  Is  reversed.  This  results  In  a  function  that  is 
minimal  for  similar  vectors.  The  negative  of  a  similarity  function  Is  therefore  also  a  similarity  function. 

Examples  of  first-order  similarity  measures  Include  those  based  on  Minkowski  Metrics.  That  Is.  the 
form  S(x,  y)  =  £[Li  lx,-  ~  or  lts  negative  can  be  used.  An  Inner-product  can  also  be  used.  e.g.  the 
dot-product.  S(x.y)  =  *,-y-  • 

The  notion  or  similarity  presented  here  Is  meant  to  be  ‘distance-based*.  In  a  vector  space,  two 
vectors  of  the  same  length  will  become  similar  If  their  distance  (as  determined  by  the  appropriate  rector- 
norm)  Is  decreased.  For  vectors  of  a  fixed  length,  this  amounts  to  decreasing  the  angle  between  the 
directions  of  the  two  vectors.  This  corresponds  to  minimizing  their  dot-produet.  Distance- based 
similarity  measures,  particularly  the  dot-product,  are  especially  relevant  to  the  study  of  the  assoclator. 
The  output  of  the  assoclator  Is  based  upon  the  similarity  of  the  Input-vector  to  the  assoclator’s  Input- 
prototypees  as  determined  by  the  dot-product  (see  equation  (2.25)). 

We  do  not  discuss  detection  or  best-match  processes  In  this  Investigation,  but  point  out  that  they 
play  a  role  In  the  considerations  made  In  the  analysis.  When  discussing  Information  that  one  vector 
provides  about  another,  we  have  assumed  the  Information  Is  distance-information.  This  characterization 
of  Information  Is  consistent  with  the  dynamics  of  most  'neural-networks*.  Each  cell  or  unit  computes  Its 
output  as  a  functlan  of  the  dot-product  similarity  of  the  Input-vector  and  the  unit's  weight-vector.  The 
•computation*  done  by  an  assoclator  Is  therefore  based  on  similarity/distance  Information. 

A  best-match  process  used  for  detection  (second-stage,  as  shown  In  1-1)  can  Itself  be  an  assoclator  or 
rather,  an  auto-associator  and  so  will  base  Its  output  upon  distance-information  relating  the  (first-stage) 
assoclator  s  output  to  the  output-prototypes  of  the  combined  classifier.  When  speaking  In  later  chapters 
or  the  Information  that  the  first-stage  provides  at  Its  output,  we  will  assume  tbe  Information  Is 
distance ' similarity  Information  so  as  to  be  consistent  with  the  nature  of  the  best-match  process.  We  also 
mention  that  the  performance  of  a  best-match  process  as  a  classification  device  will  depend  upon  the 
similarity  measure  It  uses.  When  comparing  vectors,  such  a  measure  must  preserve  all  distance 
Information  for  optimal  performance.  We've  assumed  that  distance  Information  between  two  vectors  Is 
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completely  specified  by  componentwlse-slmllarlty.  Under  this  assumption,  tbe  dot-product  seems  optimal, 
at  least  for  bit-vectors.  When  bit-vectors  are  to  be  compared,  there  Is  a  one-to-one  correspondence 
between  tbe  dot-product  and  the  Hammlng-dlstanee  so  that  tbe  dot-product  preserves  Hammlng-dlstanee 
Information. 
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Chapter  3 

Information  Theory  of  Memory 

3.1.  Introduction:  Access  v.s.  Aggregate  Memory 

In  this  chapter  a  general  Information-theoretic  formulation  of  memory  Is  presented.  Storage  Is 
characterized  as  the  generation  of  a  memory  r.v.  called  the  ‘memory  trace*  from  two  random  variables 
called  the  address  and  the  datum.  Even  If  the  memory  trace  Is  a  deterministic  function  of  the  address  and 
the  datum,  the  address  and  datum  are  r.v.'s,  so  the  memory  state  they  generate  during  storage  can  be 
viewed  as  a  r.v.  from  the  point-of-vlew  of  retrieval.  Retrieval  Is  then  the  process  of  recovering 
Information  about  the  stored  datum  from  the  retrieval-address  In  the  presence  of  the  of  the  memory-state. 
The  signal  configuration  for  both  storage  and  retrieval  are  specified  allowing  subsequent  derivation  of 
information-theoretic  relatlons/llmltatlons.  These  limitations  are  strongly  dependent  upon  the  retrieval 
strategy  which  may  not  utilize  all  Information  available  rrom  the  memory.  Retrieval  methods  will  be 
formulated  and  performance  of  the  system  will  be  evaluated  with  respect  to  a  particular  retrieval  strategy. 

3.2.  Information-Theoretic  Characterization  of  Memory 

3.3.1.  Acceaa  v.a.  Aggregate  Retrieval 

In  this  section  we  characterize  memory  as  a  configuration  of  r.v.'s  and  subsequently  define  memory 
retrieval.  We  show  how  Information  Is  stored/retrleved  as  an  aggregate  and  then  how  it  can  be 
stored/retrieved  as  a  collection  of  seperate  datum-elements.  The  first  of  these  modes  Is  referred  to  as 
aggregate- memory  and  the  second  Is  aeeeaa- memory.  When  an  aggregate  memory  can  be  partitioned 
Into  access  memory,  we  say  that  It  Is  aeeeaelble  and  the  storage  (retrieval)  of  a  datum-element  Is  called  a 
•torage-acceaa  (retrievai-aeeeaa). 

For  accessible  systems,  an  upper  bound  Is  found  for  the  aggregate-information  the  memory  can 
provide  and  this  Is  then  used  to  upper-bound  the  amount  of  Information  the  memory  can  provide  during  a 
single  access  (called  acceaa-information). 
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More  explicitly,  we  have  for  aggregate  memory  the  random  variables  called  the  storage-address 
A  and  the  storage-datum  D.  These  are  used  during  storage  to  generate  the  random  variable  T 
called  the  memory  trace  or  simply  the  memory.  During  retrieval,  the  retrieval-address  A*  Is  used 
In  conjunction  with  the  memory  trace  T  to  obtain  the  retrieval-datum  D*.  As  a  rule,  the  address 
r.v.'s  A  and  A’  must  share  Information.  That  Is  /(A;  A’)  >  0  and  from  this  one  expects  that  during 
retrieval  the  memory  will  provide  D'  such  that  /( D  :  D’)  >0.  As  a  rule,  the  larger  the  mutual 
information  between  A  and  A*  is,  the  larger  the  mutual  information  between  D  and  D’  should  be. 
For  given  r.v.'s  A  and  A',  the  memory  is  optimal  If  /( D :  D’)  =  //(D).  That  is.  the  mutual 
Information  that  the  retrieval  datum  provides  about  the  storage  datum  Is  maximized  so  that  the  retrieval 
datum  completely  specifies  the  storage  datum. 

For  an  aggregate  memory  to  be  accessible,  it  must  have  an  addreu-partltlon.  That  Is,  there  must 

exist  r.v.'s  A  .  D  .  A’  ,  D’  .  m  =  1,  2 . M,  that  partition  A.  D,  A’,  D’  respectively  so  that 

A  =(Ar  A2 . A M),  D  =  (Dj.  D2 . D^),  and  similarly  for  A',  D' .  The  storage  and  the 

retrieval  processes  must  have  partitions  consistent  with  the  address- partition.  In  particular,  the  memory 

trace  T  must  be  determinable  from  memory  traces  T  ,  m  =  l,  2 . M;  each  T  Is  generated 

exclusively  from  Am  ,  Dm  .  Similarly,  the  retrieval  process  should  be  capable  of  generating  D’m  from 
T  and  A’  alone.  Also  we  require  /(A’  :  A_)  >  0  and  expect  that  retrieval  produces  a  retrieval 

fit  fit  fit 

datum  D’  such  that  f(D’m  ;  Dm)  >  0  .  In  many  cases  (though  not  necessarily),  optimal  memory 
retrieval  Is  taken  to  be  the  case  In  which  each  of  the  retrieval  data  D*m  completely  specify  each  of  the 
storage  data  D  . 

TO 

We  will  make  these  notions  more  precise  In  the  next  section. 

3.2.2.  Formal  Definition  of  Memory 

Storage  will  be  viewed  as  the  generation  of  a  memory  trace  T  as  a  function  of  the  storage 

addreaa  A  and  the  storage  datum  D  : 

T  =  t|A.  D)  (3.1) 

Retrieval  is  the  subsequent  generation  of  the  retrieval  datum  D’  as  a  function  of  the  retrieval 
address  A’  and  the  memory  trace  T  4 


The  memory  tr»ce  t(  )  tod  the  retrieval  d*( ■ )  function*  treated  u  determinwlie  in  this  development,  hence  the  use  of 
lower  ctse  letter*  t,  d\  A  more  jenenl  formulition  would  »J!ow  the  use  of  stocbtstic  functions.  However  the  deterministic 
is  pertinent  to  our  situation  and  we  deal  with  it  specifically  for  the  sake  of  simplicity  Note  that  a  deterministic 
functioQ  of  random  variables  produces  a  random  variable. 
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D’  =  d’(T.  A’) 


(3.2) 


The  memory  Is  defined  to  be  the  quintuple  (A.  D,  A',  t.  d’) .  Notice  that  the  memory  trace  and 
retrieval  data  are  r.v.'s  since  they  are  functions  of  r.v.'s.  The  retrieval  address  Is  typically  Identical  to  the 
storage  address  or  is  a  ’degraded*  version  of  It.  We  will  generally  consider  the  storage  and  retrieval 
address  to  be  Identical.  If  A.  D.  A'.  D'  and  T  are  matrices,  this  retrieval  process  Is  equivalent  to 
presenting  the  entire  retrieval-address  matrix  A'  to  the  memory  to  obtain  the  retrieval-datum  matrix 
D'  which  In  turn  provides  Information  about  the  entire  storage-datum  matrix  D  .  The  aggregate- 
retrievable  information  /(D*  ;  D)  will  therefore  characterize  the  Information  that  the  memory  can 
provide.  For  a  given  storage  function  for  constructing  T  ,  It  Is  desirable  to  choose  a  retrieval  function 
determining  D’  that  maximizes  /( D* :  D) . 

3.2.3.  Partitioning  Memory:  Formal  Definition  of  Aeceaa-Memory 

For  access  storage  and  retrieval,  one  partitions  the  storage  address  A  and  datum  D  Into  M  parts 
A,,A_ . A.,  and  D,,D„,  .  .  .  ,DU  respectively.  For  our  situation  the  A _'s  will  be  mutually 

12  M  1  *  M  W 

Independent  and  Identically  distributed  over  a  common  sample  space  and  similarly  for  the  D_'s.  The 
storage  process  Is  In  turn  divided  Into  M  parts  given  by  the  relation 

T  =  t  (A  ,  D  )  .  m  =  1.  2 . M  (3.3) 

m  A  Tn  m 

The  acceaa-atorage  function  tA  must  be  chosen  so  that  T  specified  In  (3.1)  Is  a  symmetric  function 
T  as  t_(T,,T. . T,.)  of  the  T  's.  In  other  words,  permuting  the  arguments  of  t  doesn't 

5  1*  iVi  m  5 

change  the  value  of  the  function  determining  T  . 

The  retrieval  process  Is  similarly  divided  Into  M  parts.  The  retrieval  address  A'  Is  partitioned 
Into  parts  A'j.A’j,  .  .  .  _A’M  and  the  retrieval  datum  D'  Into  parts 

D'  =  d’  (T.  A'  ) .  m=l.  2 . M  (3.4) 

fTt  A  ff* 

The  acceaa-retrleval  function  d’A  must  be  chosen  so  that  D*  specified  by  (3.2)  Is  the  ,\/-tuple 
D’  =  (D’j.  D\ . D’w).  We  cal1  the  Quintuple 

tA.dy 

the  accesa-partltlon  of  the  memory.  A  memory  that  has  an  access  partition  Is  called  aceeta* memory. 
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Under  the  conditions  stated  above,  the  Information  /(D’m  ;  Dm)  that  the  mlh  retrieved  datum  provides 
about  the  m1*1  storage  datum  should  be  Independent  of  m  .  This  hasn't  been  proven  here,  but  the 
condition  holds  for  memory  systems  we  are  Interested  In.  We  therefore  assume  that  /( D'  ;  D  ) .  called 

m  m 

the  access-retrievable  Information,  Is  Independent  of  m  .  The  access-memory  Is  said  to  be 
access-separable  or  separable  If  the  r.v.’s  D’  and  D  and  their  respective  partitions  satisfy 

1.  Access-Inclusive:  /( D’  ;  D  )  =  /( D*  ;  D  )  m  =  1,2 . M  (3.5) 

tti  m  m  '  ' 

2.  Access-Exclusive:  /(D  ;  D’  )  ==  /( D  ;  D’  )  m  =  1.2 . M  (3  8) 

tti  rn  m  '  ' 

M 

3.  Access-Summable:  /(D*  :  D)  =  ^  /(D’m  ;  Dm)  (3.7) 

m—l 

ir  additionally,  the  value  or  /( D’  ;  D_J  Is  the  same  for  all  m  .  then  the  memory  Information  Is  said  to 

mm 

be  uniformly  access-separable  or  simply  unlformly-separable.  In  this  case.  Tor  fixed  m 

/(D*;D)=M7(D'm;D(|i)  (3.8) 

The  first  of  the  three  conditions  above  states  that  the  Information  that  the  mth  retrieval 
datum  D’  provides  as  much  information  about  the  m1**  stored  datum  D  as  does  the  entire  retrieved 

tuplet  D’  =  (D'j.  D'2 . D’^) '  The  ldea  ,s  that  ,nc^u^M  all  lh*  Information  available  about 

D  that  Is  available  from  D'  .  Likewise,  the  second  condition  states  that  the  Information  that  D’ 

m  m 

provides  about  D  Is  no  greater  than  the  information  that  It  provides  about  D*  .  Again,  the  Idea  Is 
that  D’m  excludes  Information  about  D4  .  k  jL  m  .  Heurtstlcally,  the  first  condition  states  that  D’^ 
provides  all  the  Information  obtainable  about  D  and  the  second  states  that  It  provides  only  Information 
about  Dm  .  These  two  conditions  would  seem  to  Imply  the  third,  but  the  author  has  no  proof  for  this. 
The  conjecture,  which  could  be  false.  Is  left  here  as  an  open  question. 


3.3.  Characterization  of  Storage  Capacity 

3.3.1.  Bounds  on  Retrievable  Information 

We  now  show  that  when  the  retrieval-address  A'  provides  do  direct  Information  a^out  the  stored 
datum  D  .  the  Information.  /( D ’  :  D) .  that  the  retrieval-datum  D’  provides  about  the  storage-datum 
D  Is  bourled  by  the  storage-matrix  entropy  Explicitly,  we  show 
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Theorem  1:  Let  (A.  D.  A’,  t.  d*)  be  a  memory  with  A'  Independent  of  D  .  Then 

/(D':D)  <  7/(T)  (3.0) 


Proof:  Since  D’  Is  a  function  of  A’  and  T,  we  have  by  (2.13)  that 
7(D’  ;  D)  <  /(A’.  T  ;  D) .  By  (2.18)  we  have 

/(A’.  T  ;  D)  =  7(T  ;  D  |  A*)  +  7(D  ;  A‘) 

=  7/(T  |  A’)  -  7/(T  |  D.  A’)  <  77(T) 

where  7(D  ;  A')  =  0  since  A'  Is  Independent  of  D  .  The  theorem  follows. 

We  see  from  the  proof  of  the  theorem  that 

7(D* ;  D)  <  7(A\  T  :  D)  <  H( T)  (3.10) 

If  A  Is  Independent  of  D  then  this  relation  holds  for  the  case  that  A1  a  A.  If  additionally.  A  Is 
Independent  of  T  then  the  condition  A*  m  A  Is  optimal  In  that  the  second  Inequality  of  (3.10)  becomes 
an  equality.  Since  this  will  hold  for  the  memory  systems  we  consider,  the  relation  will  be  displayed  for 
future  reference: 

Corailary:  When  the  conditions  of  theorem  1  hold  Tor  A’  as  A  and  A  Is  Independent  of  T  we  have 
7(D’  ;  D)  <  7(T.  A  :  D)  *  77(T)  (3.11) 

We  now  have  a  bound  for  the  aggregate-retrievable  Information.  If  the  memory  Is  uniformly  separable, 
then  we  will  have  a  bound  on  the  Information  retrievable  on  each  access. 

3.3.2.  Storage  and  Storage  Capacity 

To  obtain  a  bound  on  the  Information  retrievable  on  the  m1^  access,  assume  that  the  memory 


(A.  D.  A’,  t.  d’)  is  uniformly  separable.  We  then  have  for  any  m  =  1,2,  .  .  .  ,,V7 : 

,\7  ■  HD'm  ;  D  J  =  7(D’  :  D)  <  77(T)  (3.12) 


so  that 
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/(D*  ;D  )  <  H( T)/M 

m  m 


(3.13) 


We  will  call  this  the  uniform-access  bound. 

The  uniform-access  bound  motivates  the  definition  of  storage  and  storage  capacity  for  uniformly 
separable  memory.  For  the  systems  we  will  consider.  A*  s  A  Is  optimal  In  the  sense  mentioned  In  the 
previous  section.  We  assume  then  that  the  retrieval  address  Is  Identical  to  the  storage  address  and 
suppose  that  AD'  ;  D  )  Is  Independent  of  index  m  but  Is  a  function  I(M)  of  the  number  M  of  Items 

mm 

stored.  From  (3.12).  I(\f)  must  satisfy 

M-HAt)  <  /AT)  (3.14) 

The  product  on  the  left  Is  the  Information  storage  of  the  system.  The  storage  capacity  will  be  defined 
as 

Ce  m  max  M  I(Xf)  (3.15) 

A  M 

There  are  two  ways  to  obtain  a  maximum  of  the  number  M  of  storable  Items.  The  first  assumes  that 
the  product  M  ■  I(M)  Increases  to  a  maximum  as  M  Increases  to  a  value.  M  .  then  decreases.  In  this 
case  equation  (3.15)  Implies 

Cs  =  M*  ■  l(M* )  (3.10) 

where  the  right-hand-side  Is  bounded  above  by  the  entropy  H( T)  evaluated  at  M  which  we  denote 
H( T.  <V/*).  If  I(\1  )  can  be  determined,  then  by  (3.15) 

M*  <  max  /AT.  M)  /  A  A/* )  (3.17) 

M 

Another  bound  for  M  utilizes  a  lower  bound  L(\f)  for  AD*  ;  D  )  as  a  criterion  for  system 

m  m 

performance.  Specifically,  we  make  the  constraint  that 

U.\f)  <  A  A  f)  (3.18) 

as  a  -equlrement  for  minimal  system  performance.  If  L(\f)  Is  smaller  than  l(\f)  for  small  values  of  M 
but  overtakes  A  A f)  as  \f  grows,  a  bound  for  M  can  be  obtained  from  the  constraint. 
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For  the  case  that  the  memory  Is  not  separable.  It  may  still  be  uniform  In  the  sense  that 
/(D*  -  D  )  Is  Independent  of  m  6  (l.  2.  ...  M]  .  For  the  Instances  we  consider,  relations  (3.12)  and 

v  m  m 

(3.13)  still  hold  so  the  methods  of  bounding  M  explained  above  apply.  These  methods  will  be  utilized  in 
the  next  chapter. 


3.4.  Relation  of  Separability  of  Memory  to  Performance 

3.4.1.  Non-Separablllty  of  Distributed  Memory 

For  associative  Item-memory,  we  make  the  indentlflcatlon  A,  A'sF.  D=sG,  TsW  and 

D'  s  G' .  Aggregate  storage  Is  then  given  by  (2.20)  and  aggregate  retrieval  by  (2.22).  The  access- 

partition  of  the  address  and  datum  Is  just  the  division  of  the  matrices  into  columns  corresponding  to  the 

prototype  vectors.  The  Input-prototypes  partition  the  address  F  ,  each  acting  as  a  separate  "address 

word"  and  the  output-prototypes  partition  the  stored-datum  G,  acting  as  individual  "datum  words". 

The  datum  G  Is  said  to  be  stored  at  "location"  F  .  Access-storage  Is  specified  by  (2.10)  and  access- 
m  m 

retrieval  Is  given  by  (2.21). 

From  calculations  done  outside  this  Investigation,  the  Ilnear-assoclator  as  an  'tem  memory  Is 
conjectured  not  to  be  separable  except  In  limited  cases.  A  preliminary  development  by  the  author  has 
determined  that  Item  memory  might  be  access-inclusive  when  M  <  n^/5.  Further,  it  may  actually  be 
separable  when  n^/5  >  M  >  exp2(nQ) .  These  are  submitted  as  sufficient  conditions  for  separability 
but  may  not  be  necessary.  A  memory  with  an  lnput-dlmenslonallty  exceeding  2  M  and  an  output- 
dimensionality  a  few  times  log2  M  might  be  separable.  Such  a  configuration  Is  consistent  with  those 
considered  later  In  the  chapter  on  classification.  For  classification,  systems  with  lnput-dlmenslonallty 
greatly  exceeding  the  output-dimensionality  are  most  efficiently  suited  to  the  task. 

On  the  other  hand,  separable  memory  is  Identical  In  function  to  digital  RAM  or  local  memory.  The 
fact  that  matrix-based  memories  distribute  the  Information  for  each  association  over  the  entire  matrix 
means  that  the  information  for  each  association  Is  overlaid  with  that  of  the  others.  This  feature  Is  what 
allows  the  Information  for  separate  associations  to  Interact.  Regularities  In  the  Input-to-output  mappings 
specified  by  many  associations  should  be  "amplified"  whereas  Irregularities/Inconsistencies  would  be 
attenuated  In  the  memory's  Input-to-output  map.  This  Interaction  Is  contrary  to  the  notion  of 
separability.  In  fact,  non-separablllty  is  the  very  feature  that  constitutes  the  capacity  of  distributed 
memory  for  "pattern  discovery"  '6.  40.  ch.  i|  and  other  functions  that  make  them  of  computational 
Interest.  The  non-separablllty  of  these  systems  makes  their  storage  capacity  more  difficult  to  ascertain 
However,  the  property  "super-summable"  exists  for  these  systems  so  that  bounds  on  the  per-ltem 
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retrieval-information  can  be  found  In  terms  of  the  entropy  of  the  matrix.5  This  results  In  a  bound  on  the 
number  of  Items  storable  In  the  system  with  respect  to  a  minimal  performance  criterion. 

3.4.2.  Sup«r-Summablllty  of  Item  Memory 

Assuming  that  Item-memory  Is  not  separable.  It  may  not  be  summable.  However,  the  Independence 
of  the  entries  G.  .  of  the  G  matrix  Insures  that  the  memory  Is  •uper-eummable.  That  Is 

Kt 

KG'  ;  G)  >  Y  I{G’k  :  G*}  (310) 

As  we  will  see.  this  relation  Is  quite  useful  In  subsequent  chapters  on  storage  and  classification.  For  the 
sake  of  later  analysis  then,  we  will  start  by  showing  this  inequality  and  a  useful  extension  of  It  hold.  To 

start.  H( G)  =  //(Gj  since  the  Gk 's  are  Independent.  Also  since  G  3  (Gr  G2 . gm} 

and  G’  s  (G’,.  G*2 . G’M)  we  have  that 

M  M 

H(G  |  G’)  <  £  H(Gk  |  G1)  <  £  H( Gk  |  G\) 

1  m—l 

always  holds.  Combining  these,  we  get 

HG':G)  =  //(G)  -  fflG  |  G') 

M  M 

=  E  mGk]  -  H{G  i G,)  ^  E ( H{GJ  -  H(Gk  i G<) > 

i 

M  M 

>  y  (  H{Gj  -  H{Gk  i Gv )  =  E  l(G\ :  G*> 

so  that  (3.101  holds.  The  extension  of  this  Is 

M  nO 

I(G'  -.G)  >  Y  E  I(G'kj  '■  Gkj]  (3-20) 

/*■  i 

which  Is  proven  In  a  similar  manner  by  showing 

5The  term.  *5uper-summ»ble*.  is  coined  in  intlogy  to  the  terra  *sub-sumtble*  used  by  mitbemtuciins  to  describe 
ooo-line»r  functions  ;»x|  tbit  obey  ptx  +yi  <  p(* I  +  dj)  For  our  purposes,  i  ’super-summsble*  function  would  b»ve 
the  mequiluy  reversed. 
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«GVG»>  £  •321) 

>™  I 

which  holds  because  the  components  of  G4  are  Independent. 

The  relations  (3.10)  and  (3.20)  are  useful  because  I(G'  ;  G)  Is  bounded  above  by  H( W)  and  so  we 
have  both 

A/ 

£  /(G\  ;  G4)  <  //(W)  (3.22) 

*—i 

and 

M  nO 

L  E  /(G'*y ;  G*,>  ^  H(W)  (3-23) 

*■■1  /—l 

Additionally,  If  the  memory  Is  uniform  so  that  /(G*k  ;  G^)  Is  the  same  for  all  k  .  and  I((7 h- ;  Gkj)  Is  the 
same  for  all  k.  j .  then  (3.22)  and  (3.23)  become 


I(G'k  :  G4) 

< 

k  =  1,  2,  .  . 

.  .  M 

(3.2S) 

'(GW 

< 

H(W)/A 1nQ. 

k  =  1.  2.  .  . 

.  .  M .  j=  1.2.  .. 

•  nO 

(3.25) 

Thus  we  get  a  bound  on  the  Information  provided  by  any  access-retrleval-data.  G’k  about  the  storage- 
data  GA  and  also  a  bound  on  the  amount  of  Information  any  of  the  access-retrieval  components  C  k- 
provide  about  the  storage  components  Gk- . 

These  arguments  hold  when  G’  Is  replaced  by  some  componentwise  function  G"  =  g”(G’)  or 


rather  G\.  s  g"(Gtkj) . 

as 

the  retrieval  function. 

The  Inequalities  will  be  shown  here  for  future 

reference 

n  G'\:Gk) 

< 

H(  W)/M 

(3.28) 

KG"kj,Gkj) 

< 

H(W)i  A/n0 

(3.27) 

These  bounds  will  be  useful  In  later  chapters  on  storage  and  classification. 
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3.4.3.  Separability  of  Permutation  Memory 

For  permutation  memory,  the  storage  address  Is  the  matrix  F  =  (Fj.  F2 . F^).  The 

g-matrlx  In  this  equation  Is  known  to  the  detector  and  so  Is  shown  as  a  constant  rather  than  a  r  v. 
matrix.  The  storage-datum,  D  ,  Is  a  permutation  r.v.  K  whose  outcome  k  Is  one  of  the  ' 

permutations  of  the  Indices  {l.2 . A f)  .  That  Is.  k  Is  a  function  that  matches  a  given  value  m  In 

{l.2 . V/}  with  a  unique  value  «(m)  from  the  same  set.  To  store  the  datum  K .  one  applies  K  to 

the  columns  grg, . gw  of  the  matrix  g  to  get  the  matrix.  K( g)  .  whose  columns  are 

g/C( i )•  ®AT('») . ®K(vT)  The  stora*e  r  v-  matrix  Is  then  obtained  from  F  and  K  as  In  equation 

(2.24).  The  retrieval  address  F’  Is  a  matrix  r.v.  with  7(F'  ;  F)  >  0  .  Often,  we  will  take  F’  to  be  F 
The  retrieval-datum.  K?  Is  a  r.v.  whose  outcome  k1  Is  determined  as  follows: 

1.  For  m  =  1.2 . A/.  compute  the  vector  G’  =  WF’  and  select  via  a  similarity 

m  m 

measure  the  vector  gt  from  among  the  output-prototypes  that  Is  a  best-match  of 

G*  (In  the  case  there  Is  more  than  one  such  best-match,  select  one  of  them  at 

m 

random.) 

2.  Set  K’(m)  =  k  . 


This  process  represents  the  aggregate-retrieval  function  d’ .  The  access  partition  Is  the  quintuple 

where  *4  ls*lvenb*  tA(Fm.  K(m))  s  g*(m)F*  and  the 
access-retrieval  function  d'^  Is  calculated  as  shown  In  the  two  steps  above  for  only  one  value  of  m  at  a 
time. 


For  storage  of  a  permutation  k  chosen  randomly,  the  values  *(l),  /c(2) . /c(A f)  are  nearly 

Independent  for  large  M.  The  only  restriction  on  the  x(m)'s  Is  that  x(m')  ^  «(m)  when  m'  ^  m. 

For  large  A f .  this  restriction  Introduces  little  dependence  among  the  values  of  /c(m)  m  =  1,  2 . A/. 

Since  these  A f  values  are  nearly  Independent,  their  Joint  entropy  Is  approximately  the  sum  of  their 
Individual  entropies.  The  Individual  entropy  Is  log2  M  bits,  so  the  Joint  entropy  roughly  Is  A/log2  Af 
bits.  More  precisely,  the  Joint  entropy  Is  log^A/!  bits  since  the  values  x(m)  specify  one  of  A/! 
permutations.  But  log.  A/!  is  roughly  Af- log?  A/  for  large  Af  (say  for  M  >  3000).  Taking  the 
values  ic(m )  ,  m  =  1,2 . A/  to  be  Independent  Is  therefore  a  good  approximation. 

In  the  same  way.  retrieval  of  hC(m)  always  gives  some  Information  about  bC(l)  for  l  m  .  This 
Is  because  If  the  memory  Is  accurate,  then  A^m)  =  K[m)  with  probability  near  one.  Therefore,  since 
K(l)  ^  Kim)  ,  the  value  of  A^fO  Is  not  equal  to  fC{m)  again  with  probability  near  one.  In  short, 
knowing  the  value  of  A T’(m)  gives  •cross-over*  Information  about  70(1),  /  jL  m  .  In  particular,  the 
value  of  fC{l)  win  probably  not  be  the  one  observed  for  A~;m)  .  For  accurate  memory,  we  can  compute 
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this  cross-over  Information: 

/( FC{ m ) :  FCd))  =  H(KM))  -  Hmi)  \  K'(m)) 

ess  log,  .V/ -  log2  ( .V/ -  1 )  X/M 

This  Is  negligible  compared  to  the  uncertainty  of  K’(l)  for  large  M . 

Due  to  the  symmetry  of  the  memory,  and  retrieval  functions  (the  Fm's  are  l.l.d.)  the  probability 
P[fC  m)  K(m))  Is  Independent  of  m  .  Letting  P(  be  this  probability,  we  seek  the  information 
I(K’(m) :  K(m)) .  To  do  so.  we  note  that  a  best-match  process  that  produces  K'(m)  as  Its  output,  acts 
probabilistically  as  an  M-ary  symmetric  communications  channel  [12|  with  K  (m)  as  the  Item  to  be 
•transmitted*  and  K'(m)  as  the  Item  produced  at  the  •receiving  end*.  We  also  have  Pf  as  the 
probability  of  error  at  the  receiving  end.  From  this  It  follows  that  the  Information  that  the  output 
provides  about  the  Input  Is  given  by 

f(/C(m) :  K(m ))  =  log,  M-P\ ogj  (M -  1)  -  M(PJ 

(1  -  P  )tog,  M-  H(Pt)  (3.28) 

which  Is  the  Information  that  the  received  signal  provides  about  a  transmitted  signal  that  was  sent  over 
the  communication  channel.  For  small  Pf ,  /(/C(m) ;  K(m))  Is  approximately  log 2  M .  On  the  other 
hand 

log2  M  <  KtCim) ;  K(m))  <  UK?  ;  K(m))  <  H(K{m))  =  log2  M 

so  that  KbC  :  K(m))  as  /(/^(m) ;  K(m))  so  the  memory  Is  access-inclusive. 

To  show  that  the  memory  Is  access-exclusive,  the  arguement  Is  similar.  Assuming  P  Is  small, 
knowledge  of  either  K{m )  or  of  K  tells  us  with  high  probability,  what  fC(m)  will  be  (namely  the  same 
value  as  fC(m) ).  We  have 

I(fC(m) :  fC)  and  /(K’(m) ;  K(m))  as  H(K(m)) 

so  KhCim)  ;  K]  *3  /(/T(m) .  K(m)). 

To  show  the  memory  Is  access-summable.  we  retain  the  assumption  that  P  Is  small  so  that  K 
and  hC  will  be  Identical  with  ne  w  unity  probability.  This  gives  the  relation 
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UK1 :  K)  *s  H(K)  =  log2  \t\  =  M log.  A/ 

As  mentioned  earlier  /(K’tm)  ;  =»  log,  A/  so 

M 

IiK'.K)**  £  HfC(m):K(m)) 

m— I 

We  have  shown  that  the  memory  Is  access-separable.  Uniformity  follows  from  the  fact  that 

;  K(m))  *»  log2  M  for  all  m  =  l,2 . M.  In  the  low-error  case  then,  the  memory  Is 

uniformly  separable.  The  question  regarding  how  separable  the  memory  Is  for  larger  error  Is  a  subject 
open  for  further  Investigation.  Since  P(  Is  Independent  of  M .  uniformity  should  hold  even  In  the  case 
that  Pe  Is  large.  The  author  s  conjecture  Is  that  greater  error  will  degrade  separability  gradually  and 
perhaps  negligibly  provided  that  (1  —  Pf) log2  A/>  X(Pe) . 


3.4.4.  Relation  of  Performance,  Item-Memory  and  Channel-Memory 

The  notion  or  permutation-memory  Is  merely  a  formulation  ot  the  memory's  ability  to  keep  track  of 
which  Input-prototype  is  mapped  to  which  output-prototype.  For  fixed  outcomes  tm  and 

g  ,  m  =1,2 . M  of  the  prototypes  and  two  random  permutations.  K  and  K1 ,  a  matrix  storing 

the  associations  (fm,  gK(.)  should  be  different  from  the  matrix  storing  the  associations  (fm>  g^mj) . 
The  difference  should  be  reflected  In  the  response  of  the  two  matrices  to  a  given  Input.  For  associative 
memory,  the  input  will  be  some  prototype  f .  For  the  associative-classifier,  the  Input  will  be  some  bit- 
vector  tk'  that  is  closer  to  tk  than  It  Is  to  the  other  prototypes.  For  either  case,  the  matrix-output,  call 
It  g’4  ,  should  reflect  which  output-prototype,  or  8^*)  -  was  associated  with  fk  .  If  (f4.  8^(*j)  ls 

stored,  then  g’A  should  be  closer  to  8^(m)  than  to  the  other  output-prototypes.  Likewise  for  the  case 
that  ( Tk ,  Sfc'iy)  ls  stored.  In  either  case,  the  matrix-output  should  provide  an  outside  observer  (a 
detector/best-match-process  that  has  access  to  the  output-prototypes)  enough  Information  to  decipher 
which  output-prototype  Is  matched-up  with  fk  within  the  assoclator.  In  effect,  the  matrix-output  must 
provide  enough  Information  about  the  proper  output-prototype  (e.g.  9fc(k)  for  tlje  r,rst  matrix  and 
g/c'l*)  for  the  secon<1)  10  distinguish  It  from  among  the  Af  alternatives.  Of  course,  the  permutation  used 
Is  Imaginary  In  the  sense  that  we  can  relabel  the  output-prototypes  so  that  the  matrix  Is  seen  to  store  the 
associations  (f  .  g  )  .  with  this  convention,  the  output  g.'  should  provide  the  detector  with  enough 
Information,  that  Is,  log,  SI  bits,  to  allow  a  detector  to  decide  which  output-prototype  Is  ‘g^V 


In  terms  of  the  random  vectors,  G*t  has  a  mean  determined  by  but  Is  Independent  of  the 
individual  prototypes  Gm  ,  m  jL  k  ,  and  so  G'k  provides  no  Information  about  any  individual  Gm 
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The  information  that  G' k  provides  about  the  output-prototypes  to  discern  Gk  rrom  among  the  \f 
alternative  prototypes,  should  be  largely  due  to  the  Information  It  shares  with  Gk  .  This  must  be  at  least 
log2  M  bits  so 

KG'  k  :  Gk)  >  log,  \f  (3.20) 

would  seem  to  be  the  necessary  constraint  on  Item-memory. 

The  problem  Is  that  G'k  may  not  be  Independent  of  the  set  jGm  |  m  =  1.2 . ,Vf,  m  k) 

as  a  whole,  especially  when  Gk  Is  known.  Therefore  the  Information  It  provides  about  the  'correct 
choice'  among  the  prototypes  may  be  dispersed  among  all  prototypes.  The  author  has  no  precise 
formulation  for  this  problem  other  than  the  definition  of  access-separability  mentioned  earlier.  With 
access-separable  memory,  the  Information  that  G'k  provides  about  the  output-prototypes  is  exactly  the 
Information  It  provides  about  G^  so  that  (3. 211)  would  be  a  natural  consequence  of  the  present  discussion. 

Although  Item-memory  appears  not  to  be  separable,  our  dilemma  Is  resolved  by  the  following 
observations.  First,  since 

'<GVGr  G, . Gm>  ^  AGy.G*) 

the  constraint  (3.22)  will  assure  that  the  left-hand  member  of  the  above  relation  Is  at  least  log2  \{ . 
Another  consideration  Is  the  detector  Itself.  We  assume  that  It  merely  compares  G’k  with  each  of  the 
prototypes  Individually,  and  then  compares  the  M  results.  No  calculation  Involving  G’k  with  more 
than  one  prototype  at  a  time  Is  allowed.  A  detector  of  this  sort  should  only  be  sensitive  to  Information 
G’fc  provides  about  Individual  prototypes.  This  Information  Is  zero  for  all  prototypes  except  Gk  . 
Condition  (3.22)  will  therefore  be  neceaaary  for  the  detector.  Of  course,  a  more  sophisticated  detector 
which  may  not  require  this  condition  for  reliable  performance,  may  perform  better  than  Indicated  In  the 
subsequent  chapters. 
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Chapter  4 

Evaluation  of  Information-Storage  Capacity 

The  analysis  to  follow  is  concerned  with  the  case  that  the  number,  M ,  of  stored  associations  Is 
larger  than  the  Input  dimensionality.  nt ,  so  that  the  Input  vectors  are  linearly  dependent  and  Interference 
effects  must  be  accounted  for.  In  this  case  the  output  vector  Is  only  an  approximation  of  the  proper 
prototype  output.  Our  concern  Is  the  number  M of  associations  that  can  be  stored  in  a  matrix  of  a  given 
size  before  the  output  becomes  unrecognizable. 


4.1.  Characterizing  Storage  Capacity 

To  estimate  the  storage  capacity  of  the  matrix,  we  examine  a  system  that  has  stored  M 

associations  (fm.  gm) .  m  =  1.2 . M  for  some  M.  The  Input-prototype  vectors  are  rydlmenslonal 

and  the  output-prototypes  are  n0-dlmensional.  For  simplicity  or  analysis  the  prototypes  will  be  balanced 
Bernoulll-vectors  (see  chapter  2.  p.  15).  All  Input-prototypes  will  then  have  If  I2  =  n,  and  all  output- 
prototypes  will  have  |gm|2  —  nQ  •  To  motivate  the  method  of  storage  measurement,  we  make  an  analogy 
with  digital  memory.  The  address  to  the  digital  memory  can  be  viewed  as  an  Input  vector  and  the 
retrieved  data  as  the  output  vector.  A  particular  address  vector  and  the  data  vector  stored  at  the  address 
location  can  be  regarded  as  a  vector-association  pair.  The  number  of  bits  represented  by  the  data  vector 
Is  the  Information  the  system  provides  upon  performing  the  Input-to-output  association.  For  digital 
memory,  the  number  of  bits  represented  Is  the  same  as  the  number  of  blt-locatlons  In  the  data  vector  and 
so  Is  Identical  with  the  dimensionality  of  the  data  vector.  Storage  Is  defined  In  this  chapter  as  the 
amount  of  Information  per  association  multiplied  by  the  number  of  assodtlons  stored  In  memory. 
Storage  capacity  Is  the  maximum  storage  the  system  can  provide.  In  this  case,  the  storage  capacity  Is 
limited  by  the  number  of  storage  locations  of  the  memory.  Though  the  dimensionality  of  both  the  Input 
and  output  vectors  Is  specified  In  advance,  the  data  Items  are  not.  That  Is,  the  number  of  Items  that  can 
be  stored  Is  not  determined  by  what  they  are.  In  efTect,  being  able  to  retrieve  data  from  the  memory  has 
no  meaning  unless  we  are  able  to  store  an  arbitrary  data  set  at  the  outset  (ROM  Is  no  exception,  when  we 
consider  all  memory  configurations  possible  before  burn-ln).  In  essence,  the  question  'What  Is  the  storage- 
capacity  of  the  memory?*  has  no  meaning  when  one  Is  considering  a  specific  device  whose  Identity  and 
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(nput-to-output  mapping  Is  already  determined/unchangeable.  A  burned-ln  ROM  for  Is  no  longer  a 
storage  device,  merely  a  retrieval  device. 

For  the  matrix  memory,  the  storage  Is  likewise  given  by  the  Informatlon-per-assoclatlon  multiplied 
by  the  number  of  associations  stored.  The  dimensionality  of  the  Input  and  output  prototypes  are  specified 
In  advance,  but  the  prototypes  themselves  are  not.  That  Is.  we  cannot  assume  specific  values  for  the 
prototypes  In  the  analysis  to  determine  the  storage  capability  of  the  system.  Since  the  prototypes  to  be 
stored  determine  the  values  of  the  weights  of  the  memory-matrix,  the  matrix  Is  Itself  unknown.  For  this 
reason,  the  storage  of  the  memory  Is  not  defined  for  a  particular  matrix  but  rather  for  a  class  of  matrices 
all  of  the  same  size.9  The  class  of  outer-produet  matrlx-assoclators  of  a  given  size  Is  the  set  of  all 
matrices  that  can  be  generated  from  balaneed-Bernoulll  vectors  via  equation  (1.1).  The  discussion  above 
Indicates  that  an  association  Is  not  considered  to  be  stored  In  a  particular  matrix  of  the  class  unless  It  Is 
explicitly  included  In  the  sum.  (1.1)  that  determines  the  matrix. 

The  informatlon-per-assoclatlon  for  matrix  memory  can  be  characterized  In  several  ways,  two  of 

which  are  considered  here.  The  first  called  tom-memory  chooses  an  arbitrary  k  €  {1.2 . M)  and 

presents  the  kth  Input  prototype  to  the  system.  The  matrix-output  Is  then  regarded  as  a  probabilistic 
rendition  of  the  kth  output  prototype.  On  the  average  (over  all  matrices  of  the  class),  given  M.  the 
matrix-output  will  provide  a  certain  amount  of  Information  about  the  prototype  output  and  this  Is  taken 
as  the  Information  provided  by  the  association. 

The  second  method,  channel-memory  or  permutation- memory,  acts  analogously  to  an 
Information  channel.  The  klh  Input  Is  presented  to  the  system  and  an  output  Is  generated.  The  latter  Is 
compared  with  each  prototype-output  vector  via  a  similarity  measure  and  the  best  match  from  the 
prototypes  Is  chosen.  To  perform  correctly,  the  system  Is  expected  to  produce  the  k*h  output  prototype  as 
the  best-match.  If  the  Ith  output  prototype  Is  chosen,  an  error  is  Identified  with  l  ft  k.  The  probability 
of  error  averaged  over  the  matrix-class  Is  taken  as  the  error  probability  for  the  assoclator  as  an  M-ary 
symmetric  channel  (see  section  3.4.3).  The  average  mutual  Information  between  the  output  and  Input  Is 
thus  defined.  This  average  Is  considered  as  the  Information  per  association.  For  channel  memory,  we 


to  fact,  Hinton  (personal  communication)  observed  that  an  n  by  n  identity  matrix  teem*  to  have  an  exponential  amount 
of  ’storage*  since  2n  vector-pain  seem  to  be  ‘stored*.  That  it,  aeing  n-dimentionni  vector*  of  ±  I’*,  one  (elect*  on*  from 
arooni  the  2n  ponible.  Thi*  vector  ij  placed  at  the  input  of  the  *y*tem  to  retrieve  the  use  vector  at  the  output.  More 
generally  however,  thi*  can  be  done  with  an  arbitrary  matrix.  Simply  eelect  a  vector  (addre**)  of  ±  f»,  preeent  it  at  tbe 
ioput,  ’digitue*  tb*  output  into  ±  l'e  and  eay  that  the  re*ulting  vector  (data)  is  the  oat  *»tored*  at  that  addres*.  Thi* 
would  five  ail  matrices  tspontntitl  rctnrvsl  but  there  ii  no  (tor* ft  prtettt  that  allow*  one  to  epecify  which  addressee  are  to 
be  known  by  the  matrix  and  what  datum  ix  itored  at  each  addrete.  Thi*  LUuitratce  that  etorafe  and  retrieval  are  not  to  be 
confused  a*  being  the  same.  On  tbe  other  band,  they  are  nnt  independent  of  each  other  either.  Reliable  retrieval  of  a  itored 
association  or  ‘item*  will  require,  for  tb*  aesoeiator  at  !ea*t,  t!,-  cse  than  an  exponential  number  of  item*  be  (pecified 
during  tbe  storage  procest. 
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define  Tor  each  pair  of  positive  Integers  (.Y  A f)  the  matrix  channel  of  size  .V  on  A/  associations  It 
consists  of  the  ensemble  of  ail  possible  matrices  with  njnQ  =  ,V  that  can  be  constructed  from  a  set  or  A/ 

baJanced-Bernoulll-vector  prototype- pairs  (fm.  gm) ,  m  =  1,2 . A f.  Mathematically  the  ensemble 

acts  as  a  communication-channel  of  information  theory.  Once  a  particular  set  of  associations  is  chosen 
for  storage,  a  particular  matrix  Is  selected  from  the  ensemble  via  equation  (1.1).  This  matrix  Is 
deterministic  and  Is  not  itself  a  communication  channel  and  its  storage  is  not  defined. 

For  both  Item  and  channel  memory,  the  storage  Is  the  product  of  A/  and  the  information  / 

represented  by  a  single  association.  Initially,  the  storage  MI  of  the  matrix  Increases  proportionally  with 

A/.  However  the  error  probability  increases  with  M  as  well  so  that  the  information- per- association  I 
gradually  decreases.  For  some  value  A/  of  M.  the  Information  per  association  begins  to  diminish  more 
rapidly  than  M  Increases.  At  this  point,  storing  more  associations  decreases  the  total  information  storage 
of  the  system.  For  A/=  M*  .  the  system  has  reached  its  storage  capacity. 

The  fact  that  the  total  retrievable  Information  decreases  eventually  as  M  gets  large  is  net  proven 

In  this  work.  In  fact,  this  may  not  be  the  case.  On  the  other  hand,  the  channel  memory  provides  a 

minimal  criterion  Tor  memory  performance.  To  perform  well  as  a  channel,  a  system  need  only  produce  an 
output  that  is  more  similar  to  the  appropriate  output-prototype  than  to  the  others.  In  effect,  this 
demands  only  that  the  system  be  able  to  tell  the  stored  associations  apart.  This  seems  a  natural  minimal 
capability  since  Item-memory  by  contrast  demands  that  the  matrix  actually  ■reconstruct*  the  appropriate 
output  prototype.  A  system  that  can  do  this  even  with  low  fidelity  or  reproduction,  can  still  perform  well 
as  a  channel.  The  channel  memory  defines  a  lower  limit  allowable  for  the  fidelity.  Since  fidelity 
deteriorates  as  more  items  are  stored,  we  obtain  a  maximum  number  of  useful  associations  that  can  be 
stored  by  the  system.  Channel  memory  then  Is  crucial  In  determining  the  'absolute  maximum*  number  of 
associations  to  be  stored  In  a  system. 


4.2.  Bounds  on  Storage  Capacity 


4.2.1.  Restrictions  on  Relattve  Magnitude*  of  Parameter* 

The  analysis  that  follows  assumes  Important  restrictions  on  the  magnitudes  and  relative  sizes  of  the 
parameters.  These  restrictions  are  In  addition  to  any  others  derived  later  In  this  chapter. 

We  begin  with  the  requirement  that  the  Input  prototypes  and  the  output  prototypes  be  distinct 
vectors.  With  this,  the  number  A/  of  prototype-pairs  must  satisfy  [log2  A/]  <  n{  and 
[log.  A/]  <  n0  However  If  each  of  these  relations  Is  an  equality,  the  prototypes  are  already 
determined.  The  only  thing  that  can  vary  Is  which  Input  prototype  Is  paired  to  which  output  prototype 
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There  are  M'  ways  to  form  the  prototype  pairs  and  so  A/!  ways  to  form  the  matrix.  Therefore  the 
matrix  entropy  is  logj  A/ 1  A/log,  A/  bits  which  is  somewhat  less  than  we  will  find  it  to  be  when  the 
prototypes  are  randomly  selected.  The  •entropy-degradation*  caused  by  a  fixed  prototype-set,  would 
e-lously  limit  the  amount  the  amount  of  Information  the  matrix  can  provide  at  Its  output. 

In  order  to  ensure  that  the  matrix  entropy  Is  not  compromised,  we  must  be  able  to  choose  either  the 
Input  prototypes  or  the  output  prototypes  (or  both)  at  random.  If  the  randomly  chosen  input-prototypes 
are  to  be  distinct  with  high  probability,  we  must  have  2log2  A f  <  n/  and  If  the  output- prototypes  are  to 
be  randomly  chosen,  we  need  21og2  M  <  nQ.  These  requirements  ensure  that  sampling  without 
replacement  Is  virtually  Identical  to  sampling  with  replacement  so  that  no  duplicate  selections  occur.  If  at 
least  one  of  these  two  requirements  Is  met.  the  matrix-entropy  should  not  be  degraded. 

More  stringent  requirements  are  needed  if  the  prototype  vectors  are  to  be  dissimilar  to  each  other. 
This  requirement  Is  necessary  for  the  output  prototypes  If  a  best-match  algorithm  Is  to  match  the  output 
of  the  llnear-assoelator  with  the  correct  output-prototype.  A  few  errors  In  the  matrix  output  should  not 
confuse  the  best-match  process  as  they  would  If  the  prototypes  are  too  similar.  The  requirement  Is  also 
necessary  for  the  Input-prototypes  when  the  llnear-assoelator  Is  doing  classification  (see  next  chapter)  and 
the  Inputs  to  the  matrix  are  expected  to  be  similar  but  not  Identical  to  an  Input-prototype.  To  meet  the 
requirement,  the  dimensionality  of  a  vector  space  from  which  prototypes  are  to  be  chosen  cannot  be  too 
small.  If  two  balanced-Bernoulli  vectors  are  chosen  from  an  n-dlmenslonal  space  then  the  number  of 
components  that  are  Identical  between  the  two  has  average  n/2  and  standard  deviation  of  'fn  /2 .  Since 
agreement  of  exactly  n/2  components  corresponds  to  orthogonality  and  most  vectors  will  fall  within  2  or 
3  standard  deviations  or  the  mean,  the  prototypes  will  be  highly  orthogonal  If  the  mean  Is  large  compared 
to  the  standard  deviation.  For  this,  n  should  be  at  least  100  or  so. 

To  ensure  dissimilar  vectors  one  must  also  consider  the  number  of  prototypes  to  be  chosen.  The 
minimal  distance  occurring  between  two  balanced-Bernoulli  vectors  from  among  M  vectors  chosen  from 
n-dlmenslonal  space  Is  roughly  n/2  —  V 21n  M  ■  v/n/2  (see  appendix  B).  In  order  that  the  two  most 
similar  prototypes  be  dissimilar,  we  require  that  the  above  minimal  distance  be  nearly  n/2.  This  will 
occur  when  v/21n  Af  •  Vni 2  is  small  in  comparison.  As  we  shall  see,  the  number  M  of  prototype- pairs 
to  be  stored  In  the  matrix  should  not  exceed  the  number  of  weights  in  the  matrix.  If  the  matrix  Is  square, 
this  means  A f  will  not  exceed  n2  where  n  is  both  the  Input  and  output  dimensionality.  For  this 
maximal  value  of  M  we  need  /  2ln  \f  ■  'J n/2  to  be  several  times  smaller  than  n/2.  This  sets  a 
minimal  bound  on  n  .  If  we  require  at  least  an  eight-fold  difference  between  n/2  and  V 2)n  Af  •  '/n/2  , 
then  n  must  be  Just  over  1000  or  larger.  A  four-fold  difference  produces  a  lower  bound  Just  under  eoo. 
In  any  event,  the  prototype  dimensionality,  both  Input  and  output,  should  be  several  hundred  if  an 
assoclator  Is  to  discriminate  well  between  a  large  number  of  stored  prototypes. 
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4.2.2.  Matrix  Entropy 

As  shown  In  the  previous  chapter,  the  amount  of  Information  retrievable  from  the  matrix  W  is 
bounded  above  by  Its  entropy  //(W)  .  !n  this  section,  the  matrix-entropy  Is  estimated  and  used  to 
ascertain  the  storage  capacity  of  the  matrix. 

Given  the  \1  Input-output  prototype- pairs  (fm.  gm) ,  the  matrix  defined  by  equation  (1.1)  is  seen 
as  the  sum  of  \i  outer-product  matrices.  The  mtl>  outer-product  or  aMoelatlon-plane  or  plane.  Is 
completely  determined  by  the  n{  +  nQ  bits  of  tm  and  gm.  Its  Jllh  component  c ~  Is  the  product  fmig  - 
which  takes  values  In  { —  1 . 1 } .  The  mth  association-plane  Is  not  changed  If  both  f  and  g  are  multiplied 
by  -l.  This  Indicates  that  the  m1*1  plane  represents  at  most  +  nQ  —  1  bits  of  Information.  In  fact,  the 
entries  of  any  given  row  and  column  are  enough  to  determine  every  other  entry  In  the  plane.  To 
Illustrate,  examine  the  kth  row  and  1th  column  and  the  entry  c..  =  f  g  ..  These  three  entries  (bits) 

j%  mi  ty%] 

c, ..  c.,  and  c,  determine  c ..  so  that  the  parity  of  these  four  numbers  is  even.  The  n,-l-  —  1  entries 

kt  kl  p  p  I  O 

that  make  up  a  particular  row  and  column,  are  easily  seen  to  be  Independent,  so  that  +  nQ  —  l  Is  also 
the  lower  bound  on  the  Information  In  a  plane.  We  conclude  that  each  association- plane  represents 
exactly  n{  +  nQ  -  l  bits.  We  mention  also  that  the  entropy  of  the  association  plane  Is  the  same  even 
when  the  output  (input)  prototypes  are  fixed  outcomes  leaving  only  the  Input  (output)  prototypes  as 
balanced-Bernoulll  vectors.  From  this  we  have  that  the  matrix-sum  W  of  the  association  planes  has  the 
same  entropy  Trom  the  point  of  view  of  an  external  process  that  has  knowledge  or  either  (but  not  both) 
the  set  of  Input-prototypes  or  the  set  of  output-prototypes. 

When  the  association- planes  are  summed.  Information  Is  lost.  To  assess  the  matrix  entropy,  note 
that  each  of  the  entries  W..  of  the  matrix  Is  the  sum  of  M  ■  bits*  /  g  ■ ,  m  =  1.2 . M.  Therefore 

]%  mi  rnj 

W..  ~  Bin(±i.M.t  /2).  As  shown  in  appendix  A,  the  entropy  of  W..  Is 
p  p 

l  ncM 

H{W  )  =»  -log,—--  (i  ll 

P  2  2  2 

As  mentioned  In  the  previous  chapter,  the  entropy  of  a  set  of  random  variables  Is  bounded  above  by  the 
sum  of  the  Individual  entropies  (see  equation  (2.2)).  Since  there  are  N  weights,  where  N  =  .  and 

since  the  weights  have  Identical  entropies,  the  upper  bound  of  H( W)  Is  obtained  by  multiplying  the 
common  weight-entropy  (l/2)log„  (Te.V//2)  by  N.  The  entropy  WfW)  will  obtain  this  upper  bound  If 
and  only  If  the  weights  are  Independent.  The  assumption  that  the  weights  are  Independent  Is  false  for 
Individual  association  planes.  However  the  planes  are  Independent  and  the  bit-patterns  In  one  plane  will 
not  generally  be  present  In  the  others.  For  the  sum  of  M  such  planes  where  \f  Is  large,  the  weight- 
independence  assumption  should  provide  a  close  approximation  the  the  true  matrix  entropy  when  M  is 
much  larger  than  both  and  nQ.  We  conclude  then  that 
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1  ne Af 

Hi  W)  =  -Mog, -  .4.2) 

O  *  O 

*  «■ 

is  a  good  apprcx!mir!on  when  A I  >  n ,  and  A/ >  n^. 

4.2.3.  Bound  on  the  Number  of  Items  Storable 

Consider  the  situation  In  which  the  kth  Input-prototype.  F fc  Is  present  at  the  Input  or  the  llnear- 
assoclator  and  some  process  provides  Information  about  the  It'*1  output-prototype  G^  on  the  basis  of 
what  It  sees  at  the  memory  output.  If  the  average  Information  It  provides  about  Is  /  bits  then  from 
relation  (3.12)  of  the  previous  chapter,  we  must  have 

MI  <  Hi  W) 

Replacing  Hi W)  with  Its  upper  bound 

1  neM 
MI  <  -Mog«  — 

2  *2 


so  that 


log,  M  +  log,  (tre/2) 


We  make  the  approximation  log2  (rre/2)  as  2  to  gel 


M 
i V 


< 


log2  M  +  2 
2  / 


(4.3) 


In  the  case  that  the  process  at  the  output  of  the  matrix  Is  a  best-match  algorithm,  the  matrix  Is  acting  as 
a  channel.  By  equation  (3.28).  page  32.  we  have 

/  =  log,  Af  -  F  log,  (A/-  i)  -  it(P) 
m  '  •  € 

where  Pf  Is  the  probability  that  the  best-match  process  chooses  a  prototype  other  than  G^  as  the  one 
most  closely  resembling  the  matrix-output  vector.  For  our  purposes,  Af  —  1  as  Af  and  so 


/  (l-P)log,  Af  -  APe) 


(4.4) 
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Equation  (4.3)  becomes 


Af 


< 


!  log,  A {  ~  2 

2  (1  -  Pf)iOg,  -V/  -  »\P') 


(4.5) 


Our  criterion  for  minimal  channel  performance  is  that  Pf  =  0  in  which  case  /  =  log,  .W  This  gives  the 
upper  bound  on  Af/jV 


M  l  l 

—  <  -  - - 

.V  “  2  log.  A/ 


(4.9) 


for  perfect  channel  performance.  When  M  Is  large,  say  log2  M  >  19  .  the  upper  bound  for  A//.V  is 
only  negligibly  larger  than  1,2.  Therefore  we  define  the  atorage  load  or  load.  L,  or  the  system  to  be 
the  ratio  2A//.V.  A  load  of  l  corresponds  to  storing  half  as  many  prototype- pairs  in  the  memory  as 
there  are  weights  In  the  matrix.  For  taige  systems  (50.000  weights  or  more),  a  load  larger  than  one 
precludes  operation  of  the  memory  as  a  perfect  channel. 


4.2.4.  Trading  Storage  with  Error 

To  understand  how  the  load  trades  with  error  rate  P(  ,  we  rewrite  equation  (4.5)  as  the  quotient 


M  l  '°8g  Af  -*•  2  _ J _ 

N  ~  2  (1  -Pe)log2.U  '  1  -  *(P)/I(l  -  Pe)log2  MI 

letting  z  —  %AP')i  1(1  -  Pe)log2  AfJ  and  assuming  this  fraction  Is  less  than  1/3,  we  use  the  approximation 
1/(1  -z)  *9  1  +  i  to  get 


M  ^  1  ,0*2  2 

.V  ~  2  (1-  Pt)log2  Af  1  +  (1  -  Pe)log2  Af 

1  1  1 

~  ( 1  -  Pt)  2  logj  A/  1  +  ( 1  -  Pt) )og2  Af 

If  we  assume  that  P(  <  12  and  that  2/(log2  A f)2  Is  less  than  say  1,19,  then  when  we  multiply  out  the 
right-hand-side,  we  can  ignore  the  rf(Pt)  !(l  -  P()(log2  .V/)2 1  term  to  get 


A f  l  1 

.V  “  (\-Pt)  2 


1 

log2  Af 


APt) 

2(1  -  Pf)log2  Af 


(4.7) 
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This  approximation  is  good  for  Af  >  2®  when  P{  <  12.  These  restrictions  ensure  that  the  term 
defined  above  Is  less  than  l  3  which  in  turn  ensures  that  the  term  we  Ignored  to  get  relation  i4.7)  Is  small. 
If  we  allow  P  to  be  as  large  as  3  4,  then  we  obtain  a  minimum  value,  21*  .  of  A i  required  for  the 
validity  of  (4.7). 

A  simpler  bound  for  A/.  .V  is  afforded  for  Af  >  219  .  In  this  case,  If  P(  Is  less  than  l  2,  the  term 

(l-P^log2.\/  Is  much  larger  than  'd(Pf)  so  that  the  latter  can  be  Ignored  In  relation  (4  5).  The 

relation  then  becomes 

Af  ill 

N  ~  (l  -  Pe)  2  log2  M  (4'8) 

Notice  that  this  Is  the  bound  In  equation  (4.8)  multiplied  by  the  Inverse  of  the  ‘success  rate*  (1  —  P )  , 
The  approximation  Is  valid  for  more  modest  values  of  M  when  P(  Is  smaller  than  1/2.  Summarizing  the 
analysis  for  larger  systems,  the  number  N  of  weights  needed  to  store  M  associations  for  fixed  Pf  Is 

O  (Af)  .  Allowing  the  load  Tactor  L  s  2M/N  to  be  larger  than  1,  say  L  =  1/(1  —  r) ,  0  <  r  <  l  . 

Implies  the  erro1-  rate  Pg  will  be  at  least  as  large  as  r . 


4.2.5.  Storage  Limits  for  Item  Memory 

Now  we  turn  our  attention  to  Item-memory.  We  assume  that  when  the  ktf^  'nput  prototype  Is 
presented  to  the  matrix,  the  matrix  output  Is  used  exclusively  to  produce  a  bit  vector  that  Is  as  accurate  a 
rendition  of  the  ktl1  output  prototype  as  possible.  It  Is  assumed  that  no  Information  other  than  that 
provided  by  the  matrix-output  Is  allowed  for  production  of  the  bit-vector.  To  be  consistent  with  the  other 
sections  of  this  thesis,  we  denote  the  systems  ‘rendition*  of  Gk  as  Gk".  The  term.  I .  In  equation  (4.3) 

Is  now  I{Gk":Gk) .  For  the  case  that  P(Gkj'  —  Gk})  l,  j  =  1,2 . A/,  we  have  that  I  must  be 

nQ  bits  and  so 


\f  log,  A /  d-  2 


Substituting  for  .V  and  rearranging,  a  criterion  for  Is  found 

2  Af 

n,  >  - 

1  log,  M  -  2 

For  large  A/  (say  Af  >  18  )  we  can  Ignore  the  2  In  the  denominator  to  get 


(4.0) 
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n 


/ 


> 


2  M 
log2  M 


(4.10) 


Since  the  blc-error  rate  Is  near  zero,  Gk"  should  be  virtually  Identical  to  If  a  best-match  Is  used  to 

select  the  output-prototype  that  Is  nearest  to  G^''  .  then  G^  will  be  chosen  with  near  certainty.  In  other 
words.  If  we  define  P(  as  the  probability  that  G^  Is  not  chosen  then  P(  should  be  near  zero. 


For  this  condition  to  hold,  the  memory  must  provide  enough  Information  at  its  output  to  act  as  a 
channel  with  no  errors.  Therefore  relation  (4.8)  must  be  satisfied.  Using  this  together  with  (4.9)  and  the 
fact  that  1 V  =  n[nQ  one  gets  a  lower  bound  on  nQ 


nQ  >  log2  M  +  2 


which  Is  a  minimal  requirement  to  be  made  considering  the  parameter  constraints  discussed  earlier  In  the 
chapter. 

For  Illustration,  we  design  a  matrix  to  store  M  =  50.000  pairs.  With  this  large  number,  relation 
(4.8)  Implies  that  at  N  Is  at  least  100.000.  The  minimal  value  for  becomes  about  5700  and  the 
minimum  for  nQ  is  about  18.  With  these  values,  the  number  or  weights  becomes  108,200.  We  win 
compare  this  with  the  matrix  parameters  derived  In  the  next  section  In  which  the  system  Is  allowed  to 
make  errors. 


4.2.8.  Item-Memory  with  Errort 

Now  consider  the  case  that  the  components  of  Gk"  each  agree  with  their  counterparts  in  Gk  with 

n 

probability  noticeably  less  than  l.  Assume  that  the  probability  that  a  pair  Gk-  and  Gk-  agree  Is 

Independent  of  j  —  1.2 . nQ  and  call  this  probability  pQ  .  The  probability  of  disagreement  between 

a  pair  of  components  Is  1  -  pG  which  Is  non-zero  and  so  Gk"  will  contain  a  substantial  number  of  bits 
that  are  In  error.  In  this  case,  a  best-match  algorithm  that  compares  G^"  with  the  output-prototypes 
will  have  a  probability  P(  >  0  that  the  wrong  match  Is  made. 

The  Information  that  Gk'  provides  about  G^  Is  bounded  above  by  the  Information  Gk  provides 
about  G^  and  bounded  below  by  the  sum  /(G^" ;  Gk-)  of  the  Information  that  Gk  provides  on  a 

bit- by- bit  basis.  The  argument  that  this  Is  a  lower  bound  Is  similar  to  the  argument  given  In  the  previous 
chapter  to  substantiate  relations  (3.19)  and  (3.20).  The  Information  that  Gk-"  provides  about  Gkj  Is 
given  by  (l  -  mpG))  .  Using  the  above  lower  bound  for  /,  this  Implies  that  relation  (4.3)  holds  with  / 


replaced  by  nQ(l  -  H(pc I)  .  Assume  that  pG  <  0  88  so  we  can  approximate  l  -  *ipc)  by 
(2log„  e)(pc  “1  2 12  as  per  equation  12.29).  Inequality  (4.3)  becomes 


\f 

\ 


< 


log2  A 1  -  2 
2nQ(21og2  e)(pG  -  1,  2)2 


For  M  >  exp„(18)  .  we  can  Ignore  the  2  In  the  numerator  on  the  right  to  get 


(4  11) 


\{  In  ,W 

<  - ; 

*  4n0(pG-12)* 

We  can  get  a  lower  bound  on  n;  by  replacing  N  In  (4.11)  by  n fiQ  and  rearranging 


4A/(!og2  e)(pG  -  1/2)2 

n.  >  - — - 

1  ”  logj  M  -r  2 

Again,  assuming  M  >  50,000  we  can  use  (4.12)  to  get 

4 M{pG  -  1/2)2 

n,  >  —  '  . - 

1  ~  In  M 


(4.13) 


(4.14) 


which  holds  for  larger  systems.  We  assume  that  pQ  >  1/2  since  Gk "  Is  supposed  to  be  a  better-than- 
chance  rendition  of  G^.  With  this  assumption  the  above  relation  can  be  expressed  as  an  upper  bound  on 
PG  achievable  by  a  given 


pc  <  ^(1  +  \Zn;  In  M/M)  (4.15) 

Since  pG  Is  less  than  1.  there  Is  a  non-zero  probability  Pf  that  Gk"  wtll  be  mistaken  for  some  prototype 
other  than  Gfc  .  If  we  assume  that  a  best-match  among  the  output  prototypes  Is  sought  using  the  vector 
G^'  then  the  Information  /(G^"  :  G^)  must  exceed  that  required  to  operate  the  best-match  process. 
The  Information  required  for  a  best-match  process  with  error  rate  Pf  Is  given  by  (3.28)  of  the  previous 
chapter  and  we  can  assure  that  HGk"  :  Gk)  Is  larger  than  this  by  requiring 

n0(l  -  Hlpc))  >  (i  -  Pf) |0g2  \f- U(Pt) 


Assuming  that  Pf  <  l,  2  so  that  (l  -  Ff)log0  M  >  (l/2)Iog,  Af.  we  take  M  to  be  larger  than  50,000  as 
usual.  This  allows  one  to  Ignore  the  >>'(P  )  term  so  that  we  have 


nG(l  -  'Apc ))  >  U  -  Pt) lo&2  Af 
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With  the  assumption  that  12  <  PG  <  0.88  we  use  the  approximation  (2.20)  to  get 
2n0(l°82  e)(PG  -  1  2)*  >  (1  -  P()log2  M 
which  yields  the  reciprocal  relations  between  the  error  probabilities 

2nO 

P  >  1 - -  (4.18) 

*  m  M(pg~  i/2)2 


PG  >  1/2  -l-  %/(l  -  P()ln  M/(2n0) 


(4.17) 


To  obtain  a  bound  on  the  matrix  size.  nQ  can  also  be  expressed  in  terms  of  the  other  parameters: 


n 


o 


> 


(1  -  Pt) In  M 
2 (pG  -  1/2)2 


(4.18) 


Note  that  relation  (4.18)  must  hold  for  pQ  to  satisfy  both  (4.17)  and  (4.15)  simultaneously.  From  (4.18) 
and  (4.14)  we  have  N  >  2(1  -  PjM  which  Is  the  same  bound  as  given  In  (4.8)  Tor  M  large.  While 
and  nQ  depend  on  pG  ,  their  dependence  Is  reciprocal  so  the  matrix-size  needed  to  store  M  Items  Is  not 
affected  by  pG  given  a  fixed  Pt  . 


We  use  these  relations  to  design  a  matrix  that  can  store  M  =  50.000  Items  with  a  channel  error 
P(  =  1/2  and  a  output-bit  error  pQ  =  3/4.  From  relation  (4.18)  we  obtain  nQ  —  44.  From  (4.14)  we  also 
have  n(  >  1158  .  so  that  =s  50,000.  Again  the  matrix  Is  one  which  *fans-ln*  to  produce  a  highly 
reliable  output  under  a  large  storage  load.  Notice  that  In  accordance  with  (1  —  P()  =  1/2  ,  this  system 
Is  roughly  half  the  size  of  the  one  designed  earlier  for  ’perfect*  Item  retrieval. 


Under  any  of  the  above  circumstances,  the  number  of  weights  needed  for  storage  Is  O  (.Vf).  Allowing 
P{  >  0  allows  an  advantage  with  M  Increasing  roughly  proportional  to  1/(1  -  PJ,  {P(  <  1/2).  If  a 
bit-error  VG  <  1  Is  allowed,  then  P(  must  be  specified  to  determine  n{  and  nQ  as  a  function  of  A f. 
Notice  that  relations  (4.13).  (4.14)  and  (4.18)  Imply  that  can  be  made  smaller  when  pG  Is  near  1/2. 
whereas  nQ  must  be  made  larger  to  meet  the  same  storage  requirements  since  the  number  of  weights  must 
satisfy  relations  (4  11)  and  (4.5).  Requiring  that  the  bits  of  G^"  to  be  accurate  forces  either  A/  or  nQ  to 


be  small.  That  Is,  either  the  matrix  must  store  few  vectors  (small  ratio  A i.  .V)  or  the  size  ,V  =  of 
the  matrix  must  be  due  largely  to  .  Heurlstlcaity,  the  matrix  must  be  able  to  gather  a  large  amount  of 
Information  at  the  Input  compared  with  the  amount  It  supplies  at  the  output.  One  would  suspect  that  the 
Information  supplied  at  the  output  Is  a  function  of  the  Information  available  at  the  Input.  This 
observation.  which  will  be  shown  to  be  true  In  the  next  chapter,  will  be  Instrumental  In  deriving  results 
regarding  classification. 


4.3.  Storage  Efficiency 

Storage  efficiency  of  a  matrix  will  be  defined  as  the  matrix-storage  divided  by  the  Information 
required  to  represent  a  matrix  assoclator  on  M  associations.  We  know  that  the  number  of  bits  stored  by 
the  matrix  Is  the  matrix  entropy  W)  .  To  get  the  number  of  bits  required  to  score  the  matrix,  we 
examine  equation  (1.1)  to  ascertain  the  range  of  values  that  the  weights  can  assume.  This  equation 
reveals  that  each  entry  (weight)  in  an  outer-product  matrix  Is  the  sum  of  M  bits.  The  range  of  values  of 
each  entry  Is  the  set  of  integers  between  —M  and  M .  The  extremes  are  realized  whenever  the  bits  for 
that  entry  all  agree  In  value.  Further,  the  entry  will  be  be  even  If  and  only  If  M  Is  even.  It  follows  that 
the  number  or  values  an  entry  can  assume  Is  M+l.  This  means  that  l V  weights  will  require 
Mog2  (A /  +  1)  <53  Mog2  M  bits  Tor  storage.  We  define  the  efficiency  rj  by  the  matrix-entropy  divided  by 
the  number  of  bits  needed  to  represent  the  matrix 


n  — 


H(W) 


jMog2  M 


(l/2)/V(log2  M  +  2) 

Alogj  M 


-  + 


2  log2  M 


(4.10) 


which  Is  the  upper  bound  for  the  ratio  of  M  to  N .  In  this  case,  the  efficiency  Is  asymptotically  1/2. 


This  Is  not  the  best  we  can  do  however.  From  the  proof  of  the  'tails  lemma*  In  appendix  A.  page 
100.  the  entropy  H(W„)  of  a  weight  of  the  W-matrlx  can  be  approximated  by  considering  only  2r..  +  l 

J*  Iff 

of  the  most  central  values  that  the  weight  can  achieve  where  =  [v^Aflogj  \1  J  .  This  means  that 
only  these  values  occur  often  enough  to  represent  a  significant  amount  of  the  Information  represented  by 
the  weight.  So  we  can  Ignore  the  more  extreme  values  the  weight  might  take  and  thereby  only  need 
roughly  log,  !2V/2  .Wlog,  \f  )  =3  ( l ,  2)log2  (2A/)ogj  A/)  l  bits  to  store  each  weight. 

Let  .\/Q  be  a  positive  Integer  representing  the  maximum  number  of  associations  to  be  stored  in  the 
matrix.  If  we  restrict  the  weights  to  range  In  value  from  — [2.\fQ!og,  .\/0J  to  [2.Vfalog,  .\f0J  then  when 
the  number  \f  of  associations  stored  Is  no  greater  than  .VfQ  .  the  tails  lemma  prescribes  the  maximum 
number  of  tits  of  Information  lost  by  making  the  range  restriction.  The  maximum  information  lost  Is 
Riven  by  'he  upper  bound  for  t  In  the  talls-lemma  which  Is  21og„  e  ( e A f Q )  (see  (.■..*2).  condition  2  and 


related  footnote,  page  100).  Assuming  that  this  Is  the  amount  of  Information  that  Is  lost  for  each  weight, 
the  total  lost  for  the  entire  matrix  is  no  more  than  2Alog2  e/(e.V/Q)  bits.  If  the  matrix  Is  required  to  lose 
no  than  r  bits  of  Information  due  to  the  weight  restriction,  then  set  .Vf0  equal  to  ;V/r  so  that  the 
maximum  Information  loss  Is  2.Mog„  e/(e:V/r)  =  2rlogj  tit  r  bits.  For  the  case  that  the  load  L  Is 
expected  to  be  less  than  1  (that  Is  we  don't  Intend  to  overload  the  matrix),  we  can  set  to  be  ,Y,  2 
and  will  lose  no  more  than  one  bit  for  the  whole  matrix  by  restricting  the  weights  to  the  prescribed  range. 

The  efficiency  of  this  new  system  Is  again  the  matrix-entropy  divided  by  N  times  the  logarithm  of 
the  number  of  values  permitted  for  each  weight 

(l/2)JV(log2  M+  2) 

'7  ~  N{( l/2)log2  (2M)  +  (1/2)10^  (logj  M)  +  l) 


log2  M 

log2  A/+  log2  (log2  M) 


for  large  M 


(4-20) 


which  Is  asymptotically  near  1.  Therefore,  by  simply  truncating  the  range  of  the  weights,  we  can  far  a 
fully  loaded  matrix,  achieve  a  storage  efficiency  near  unity  while  losing  an  Insignificant  amount  or 
Information  about  the  matrix. 
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Chapter  5 
Classification 


5.1.  Introduction 

Whereas  the  previous  chapter  considered  the  llnear-assoclator  as  a  memory,  the  present  chapter  will 
treat  It  as  a  classifier.  The  classifier  Is  merely  a  generalization  of  the  memory  In  which  the  Input-vectors 
are  no  longer  constrained  to  be  input-prototypes.  In  this,  case,  Input-prototypes  are  each  a  representative 
or  ‘prototype*  of  a  distinct  category  of  vectors  in  the  Input-space.  An  vector  from  the  Input-space 
belongs  to  a  category  If  It  Is  closer,  under  the  Hammlng-dlstance  metric,  to  the  prototype  of  that  category 
than  to  other  Input-prototypes.  The  Input-prototype  and  Its  category  have  a  corresponding  output- 
prototype  that  represents  the  category  In  the  output  vector-space  and  the  assoclator  has  stored  the 
correspondence  between  the  Input  and  output  prototypes.  In  this  characterization,  classification  Is  similar 
to  channel-memory  (see  rigure  5-1).  The  Input-vector  by  virtue  or  Its  membership  In  a  particular 
category,  has  a  corresponding  output-prototype  which  Is  the  category's  corresponding  output-prototype. 
Proper  classification  conststs  of  associating  the  Input-vector  to  an  output-vector  that  Is  closer  to  the 
Input-vector's  corresponding  output-prototype  than  to  the  other  output-prototypes. 

The  analysis  begins  with  the  characterization  of  the  llnear-assoclator  as  a  classification  device.  A 
non-linearity  Is  applied  to  the  assoclator-output  to  facilitate  the  analysis.  Minimal  requirements  necessary 
for  proper  performance  of  the  classifier  are  explained  and  we  describe  the  assoclator's  Information 
characteristics  relating  to  achieving  these  requirements.  Methods  of  generating  Input-vectors  are 
formulated  and  are  eventually  shown  to  be  equivalent  from  the  point-of-vlew  of  the  assoclator.  The 
Information  flow  from  Input  to  output,  called  the  ’throughput*  of  the  assoclator,  Is  then  quantified  and 
related  to  performance  capability  of  the  assoclator.  We  will  then  be  In  a  position  to  determine  the 
minimal  size  of  sub-vectors  within  Input-vectors  that  act  as  ’cues’  for  the  Input-vector  category.  We  will 
also  quantify  the  percentage  of  the  Input-space  that  Is  classifiable  by  the  system.  We  then  ’revisit’ 
storage  capacity  and  quantify  Its  degradation  due  to  the  use  of  the  non-llnearlty  at  the  assoclator  output. 
Near  the  end  of  the  chapter,  the  theory  Is  Illustrated  with  a  few  classifier  designs  and  a  discussion  of 
Important  aspects  of  their  operation.  Finally,  we  derive  some  merit  parameters  for  Judging 
storage  classification  performance  of  the  assoclator  as  It  compares  with  the  best  theoretically  possible. 
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Figure  5-1:  Classification  by  Prototype-Correspondence 


5.2.  The  Associator  as  a  Classifier 

5.2.1.  Characterltatlon  of  Classification 

Consider  an  arbitrary  classification  device  as  shown  In  figure  5-2.  The  device  can  receive  any 
rjj-dlmenslonal  ±l-vector  as  an  Input  which  will  be  referred  to  as  the  input-vector.  The  device  has 
stored  Information  about  M  vectors  called  Input-prototypes.  These  prototypes  are  the  n^-dlmenslonal 

balanced-Bernoulll  vectors  Ff.F2 . F^  .  Each  one  Is  considered  to  be  an  exemplar  of  a  distinct 

category  of  n^-dlmenslonal  ±i-vectors.  An  Input-vector  that  Is  closest  In  Hammlng-dlstance  to  the 
prototype  F^  than  to  any  of  the  other  Input-prototypes  will  be  denoted  by  F^'  and  Is  said  to  belong  to 
the  Is category  Thus,  there  are  ,Vf  categories,  each  •centered*  about  Its  exemplar.  After  receiving  the 
Input  F k' .  the  classifier  Is  expected  to  emit  the  number  k  at  Its  output  to  signal  that  the  Input  belongs 
to  category  k  A  claaalflcatlon-error  (or  briefly  an  'error* )  Is  said  to  have  occurred  when  the  response 
of  the  classifier  Is  some  number  other  than  k  .  The  probability  of  classification  error  Is  denoted  Pf . 


so 


Figure  5-2;  General  Classifier  for  n^-dlmenslonal  ±l-vectors 


If  the  classification  device  ,s  to  operate  with  negligibly  small  Pf  ,  the  Input-vector.  Fk'  must 
provide  at  least  log,  A f  bits  of  Information  about  Its  category-exemplar  Fk  .  This  Is  due  to  the  fact  that 
Fk  must  be  distinguished  as  belonging  to  one  of  M  categories  and  the  only  way  the  distinction  can  be 
made  Is  to  determine  which  of  Af  exemplars  Is  closest  (see  the  chapter  on  the  Information-theory  of 
memory).  We  therefore  have  the  constraint 

/(F/:F4)  >  log  2M  (5.1) 

Now  consider  the  classification  sy  'em  of  figure  5-3.  In  this  case,  the  classifier  Is  divided  into  two 
stages.  The  flrat-atage  Is  a  Unear-assodator  whose  output  Is  fed  to  a  Hopfleld-non-llnearlty  (defined 
later).  This  stage,  called  the  aaaoclstor,  translates  rydlmenslonal  ±l-veetors  Into  n^-dlmenslonal 

±l-vectors  where  nQ  is  the  dimensionality  of  the  assoclator’s  output-prototypes  Gr  G^ . Gw . 

The  aeeond-atage  Is  a  beat-match  proceaa  that  compares  the  output  of  the  first-stage  with  the  output 
prototypes.  in  this  case,  the  Af  category-exemplars  for  the  classifier  are  the  Input-prototypes 

Fj.  Fj . Fw  .  As  Is  the  case  Tor  the  general  classifier  of  figure  5-2,  an  Input-vector  that  belongs  to 

the  category  will  be  denoted  F^'  .  The  resulting  output  of  the  llnear-assodator  matrix  win  be  called 
G k  and  the  output  of  the  Hopfleld  non-llnearlty  Is  called  Gk"  . 

I'pon  receipt  of  F^'  at  the  Input,  the  resulting  vector.  G^"  .  at  the  output  Is  expected  to  be  closer 
to  G k  than  to  any  other  output-prototype.  In  this  case,  the  best-match  process  of  tbe  second-stage 
process  will  respond  with  the  number  k  at  the  output.  We  regard  the  best-match  device  as  an  error-free 
device  Errors  will  only  occur  If  the  first-stage  produces  a  vector  Gk"  that  Is  closer  to  some  output- 


SI 


prototype  other  than  Gk  .  In  other  words,  the  analysis  Is  concerned  with  the  performance  limitations  of 
the  first  stage.  The  second-stage  Is  merely  an  artifice  Tor  the  sake  of  the  characterization  of  the 
classification  'task*  of  the  llnear-assodator.  In  fact,  the  •classification*  done  by  the  assoclator  Is  Just  Its 
passing  Information  to  the  output  that  enables  one  to  determine  which  Input-category  Is  present  at  the 
matrix-input. 

We  observe  that  the  second-stage  of  figure  5-3  Is  Itself  a  classifier  of  an  arbitrary  sort.  Its  category 

exemplars  are  the  vectors  Gj.  G2 . G^  so  Its  Input  G^"  must  provide  log2  M  bits  of  Information 

about  Gk  If  the  second-stage  Is  to  classify  reliably.  The  assumption  that 

r(Gk".Gk)  >  log  2M  (5.2) 

Is  thereby  obtained  as  a  constraint  on  the  output  G^"  of  the  first-stage. 

In  a  later  section  It  will  be  shown  that  the  output-information  /( G^'' ;  G^)  of  the  first-stage  can 
be  regard'd  as  a  linear  function  of  the  Input-information  /( F k  ;  F k) .  The  ratio 
IlG k"  .  Gk)  hF k  Fk)  will  be  denoted  by  7TW)  and  Is  called  the  throughput  of  the  assoclator. 
Knowledge  of  the  throughput  will  allow  us  to  translate  the  constraint  of  (5.2)  Into  a  constraint  on  the 
Input-vectors  F^'  This  In  turn  will  reveal  the  fraction  of  the  Input-space  7  that  can  be  classified 
The  g»neral  Idea  is  to  define  the  Input-redundancy  (or  simply  the  redundancy)  R  of  the  Input  F^'  to 
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be  the  ratio 


R  =  l(Fk'  ■  Fh),  log,  M  (5.3) 

The  constraint  (5.1)  then  stipulates  that  R  >  1  .  The  question  is  Just  how  much  redundancy  must  be 
present  at  the  Input  to  the  assoclator  to  ensure  reliable  classification.  The  answer  lies  In  the  definition  of 
throughput  from  which  we  have  /( Gk":Gk)  =  T(W)I{T  k  ;  F^) .  and  so  relations  (5.2)  aDd  (5.3)  Imply 
that  the  inequality  TXW)R\og^  M  >  logj  M  holds.  That  Is 


R  > 


i 

TtW) 


(5.4) 


In  the  case  that  the  assoclator  Is  not  lightly  loaded.  TTW)  will  be  less  than  l  so  that  by  (5.3).  the 
constraint  (5.4)  Is  more  stringent  than  relation  (5.1).  Later  It  will  be  shown  that  at  most  Ml  ~  R  of  the 
Input-space  7  Is  classifiable.  A  heavily  loaded  assoclator  will  have  a  low  throughput  and  so  require  a 
high  redundancy.  As  a  result.  It  can  classify  only  a  small  portion  of  the  Input-space. 


Since  the  classifier  of  figure  5-3  Is  merely  an  assoclator  followed  by  a  classifier,  one  may  wonder  why 
we  should  bother  with  the  first-stage  assoclator  at  ail.  One  reason  Is  that  the  assoclator  translates  Input- 
vectors  into  output-vector  ‘codes*  that  are  more  useful  to  subsequent  processing  stages.  Another  reason 
as  we  shall  see.  Is  the  dats-compreuloa  afforded  by  the  assoclator.  What  data-compresslon  Is  and  Its 
usefulness  will  be  seen  near  the  end  of  the  chapter. 


5.2.2.  Generation  of  Input  Vector* 

An  important  aspect  of  associative  memory  Is  the  ability  to  respond  to  Input-patterns  that  deviate 
from  the  stored  input-prototypes.  In  particular,  suppose  each  Input-prototype  Ffc  Is  divided  up  Into 
subvectors  called  features  (see  figure  5-4).  That  is.  some  subset  of  the  n ^  components  of  Ft  represent 
a  'field*  in  which  a  particular  'piece*  of  Information  Is  coded.  If  F^'  has  only  this  single  piece  of 
information  in  common  with  Ft  and  nothing  (other  than  coincidental  similarities)  In  common  with  the 
other  Input-prototypes,  then  we  call  F^'  a  lingle-feature  vector.  It  Is  desirable  that  an  Input-vector 
be  classifiable  even  If  It  Is  a  single-feature  vector.  Call  the  number  of  components  of  F^  that 
compose  a  particular  feature  the  feature-site.  We  seek  the  minimal  feature-size  necessary  Tor  reliable 
classification  of  a  single-feature  vector. 

Several  methods  of  Incorporating  a  feature  of  F^  In  F^'  or  Inserting  information  about  Into 
F;  are  considered  here.  The  first  Is  to  copy  r  components  of  F^  Into  F k'  and  set  the  rest  of  the 
components  of  F k  to  zero.  This  case  can  be  reduced  to  analyzing  the  storage  characteristics  of  an 
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Prototype:  (feature-1,  feature-2, . ,  feature-k, . .  feature-r) 


Single-feature  T 

input:  ( . "random" . ,  feature-k, ...."random"...) 


Figure  5-4:  Features  Within  Vectors 


assoclator  with  r-dlmenslonal  Input.  This  method  therefore  Is  not  as  Interesting  as  other  methods  which 
don't  allow  zeros  as  components  or  the  Input-vector.  Zeroing  the  'unused*  components  however  does 
have  the  advantage  that  no  spurious  Information  Is  Incorporated  Into  the  Input-vector.  As  Tar  as  the 
matrix  Is  concerned,  r  bits  or  Information  are  actually  present  at  the  Input. 

Another  method  Is  again  to  copy  r  of  F^'s  components  to  Fk  and  choose  the  rest  of  Ft''s 
components  as  a  random  selection  of  ±l's.  This  case  Is  more  Interesting  because  It  corresponds  to  Fk 
containing  Information  other  than  that  of  the  r-dlmenstonal  feature  of  Fk  .  This  additional  Information 
however  Is  not  relevant  to  the  prototypes  of  the  assoclator.  Rather.  It  Is  used  by  other  assoclator- 
classlflers  In  a  multi-classifier  system  (see  figure  5-5).  Each  assoclator  would  sample  l be  Input-vector  and 
only  act  on  the  features  the  Input  contains  that  are  relevant  to  the  prototypes  of  the  assoclator.  The 
Input  might  represent  the  functional  description  of  an  object,  each  feature  of  the  Input-vector  representing 
a  different  functional  aspect  of  the  object.  Each  assoclator  would  have  Information  about  a  specific 
•feature-type*  and  associate  features  of  this  type  to  relevant  'concepts*  or  ’goals*  of  the  system. 

This  method  of  generating  the  Input-vectors  actually  Incorporates  r  bits  of  information  about  F k 
Into  Ft'  However,  the  network  Is  probably  not  capable  of  using  all  r  bits  of  Information.  In  the  first 
place,  the  assoclator  has  no  way  of  knowing  which  of  the  r  of  F4'  are  the  copies.  What’s  more.  It  never 
varies  th«  way  In  which  It  ’weighs*  a  given  component  of  Fk  when  determining  Its  output  G^' . 
Whether  or  not  It  happens  to  weigh  the  r  components  of  the  feature  heavier  than  the  other  components 
of  th-  input,  is  a  matter  of  ‘happenstance*.  Another  related  problem  Is  that  generating  the  Input-vector 
with  in-on.-lit-r.t  Information  Is  net  well-accounted  for  by  Information  theory.  An  Input-vector  F^' 
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(feature-1,  feature-2,  ...,  feature-q) 


Figure  5-5:  A  Multi-Associator  System 


should  be  classified  with  the  category-exemplar  Fk  even  when  It  contains  Information  In  direct  opposition 
to  this  choice  or  category.  More  precisely,  copy  components  of  Fk  to  Fk'  and  copy  the  negative  of 
each  of  r2  other  components  to  F^' .  Choose  the  remaining  components  of  F k'  randomly.  We  assume 
-  r2  >  0  so  that  the  net  feature-size  Is  r  >  .  Again,  If  r  Is  large  enough,  then  the 

consistent  Information  should  ’override*  the  Inconsistent  Information  so  that  F k  Is  properly  classified 
Into  the  kth  category. 

From  an  Information-theory  pomt-of-vlew  however,  the  mutual  Information  I(Fk':Fk)  Is  no  longer 
r  bits  but  fj  -  r2  bits.  An  observer  of  Fk  .  knowing  which  components  were  copied  directly  and  which 
were  negated  could  Infer  the  rj  *  values  of  those  components  of  Fk  .  or  course,  the  assoclator  treats 
all  the  components  of  the  Input-vector  the  same.  If  r  Is  large,  the  dot-product  F k  •  F k  of  equation 
(5. 12|.  page  57.  will  be  large  and  F^'  will  be  correctly  classified.  From  the  polnt-of-vlew  of  the 
assoclator-matrlx.  the  useful  Information  Is  r  bits  not  +  r  bits.  A  more  substantial  argument  for  this 
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view  will  be  given  later.  The  arguement  depends  on  the  fact  that  the  distribution  of  the  matrix-output  Is 
a  function  of  r  only  and  does  not  otherwise  depend  on  which  of  the  above  methods  are  used  to  generate 
the  Input-vector. 

Another  method  of  generating  the  Input  Fk'  Is  to  choose  It  within  a  region  surrounding  the 
prototype  Fk  .  We  define  the  ball  of  radius  p  about  Fk  to  be  the  set 

Bk(p)  =  {x  6  7\  HD(Fk,x)  <  p)  (5.6) 


where  HD(x.y)  Is  the  Hammlng-dlstance  between  the  vectors  x  and  y  .  If  p  >  0  has  a  value  such 

that  Bk(p)  as  1/A/  then  conceivably,  each  of  the  A/  balls  Bm(p)  m  =  1,2 . A/  could  occupy  Its 

own  region  of  the  Input-space  /  with  little  overlap.  That  Is.  most  vectors  of  7  would  lie  In  exactly  one 
ball.  The  likelihood  of  small  overlap  of  all  the  balls  Is  small  but  the  Important  notion  Is  that  the  largest 
portion  of  space  each  can  occupy  Is  l/A/  without  unavoidable  overlap. 

Now  consider  generating  Fk'  by  choosing  It  at  random  from  Bk(p) .  We  will  call  this  method  of 
Input-generation  the  neighborhood  method.  An  observer  of  Fk  knowing  bow  It  was  generated,  knows 
that  the  Input-prototype  lies  within  p  of  Fk' .  Only  1/A/  of  the  Input-space  Is  this  near  Fk'  so 
this  knowledge  constitutes  an  A/-fold  decrease  In  the  number  of  possible  values  of  Fk  .  Therefore  the 
vector  Fk’  chosen  at  random  from  Bk(p)  provides  log^  A/  bits  of  Information  about  .  Observe  that 
If  p  were  decreased  so  that  Bk(p)  encompassed  only  M~R  or  the  space,  where  R  >  l  ,  then  the  input 
Information  /( F^iFj  would  Increase  to  fllog^  A/.  This  observation  will  be  useful  later  when 
comparing  the  methods  of  generating  the  assoclator-lnput. 

A  final  method  of  Input-vector  generation  Is  that  of  flipping  a  biased  coin  to  determine  for  each 
component  (bit)  of  the  Input-vector  F k  whether  It  agrees  with  the  corresponding  component  (bit)  of  F k  . 
This  will  be  referred  to  as  the  coin  method.  If  the  coin  lands  •heads',  we  copy  a  component  of  F^  to 
Fk’ .  ir  It  lands  •tails*,  we  copy  Its  negative  to  Fk'.  Letting  pF  be  the  probability  of  •heads',  the 
probability  that  a  component  of  F^'  agrees  with  Its  counterpart  In  Fk  Is  pF .  In  order  that  Fk  be  a 
better-than-chance  rendition  or  Fk  .  we  assume  that  pf  >  1/2  .  In  this  case,  the  Information  that  F k 
provides  about  Ft  Is  the  sum  over  all  components  of  the  Information  that  each  component  of  F k 
provides  about  Its  counterpart  In  Fk  .  We  can  write 


/(F, 


-  E  '"V  ■  fh> 


(SO) 


The  Information  IlF^  :Fk%)  Is  the  function  i  -  *(pf)  which  Is  1  bit  minus  the  uncertainty  *(pr)  that 
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Fk'  agrees  with  Fu  .  When  pF  Is  not  too  near  1.  (say  <  0.88  )  we  can  approximate  i  -  >(pf)  by 
2(logj  e)(pF  —  1/2)2  (see  approximation  (2.29)  page  10).  The  result  Is 

/(F7:FV  =  1  - 

*=»  2n/(log2e)(pr-  1/2)2  1/2  <  PF  <  0.88  (6.7) 

We  can  assess  the  similarity  of  the  input-vector  F k'  to  the  prototype  Fk  as  measured  by  the 
dot-product.  The  average  number  of  components  of  F^'  that  agree  with  their  counterparts  in  F  k  is 
n{pf..  The  average  number  that  disagree  is  n^l  -  p^) .  The  components  that  agree  contribute  a  1  to 
the  value  of  the  dot-product  F kF k'  and  the  components  that  disagree  add  a  -1.  Therefore  the  mean  or 
the  similarity  Is 


E(Fk  Fk')  =  njPFil)  +  -  p^H-l)  =  (2pf.-\)nl  (5.8) 

For  the  method  of  copying  r  components  to  generate  F^' ,  the  mean  similarity  Is  r  .  We  therefore  set 
r  =  (2 pF~  l)r.j  to  obtain  the  same  mean  similarity  as  for  the  coin  method.  This  gives  the  reciprocal 
relations 

r  =  {2pF~l)nJ  (6.9) 


and 


PF  - 


(6.10) 


It  will  be  argued  later  that  the  various  methods  we  described  for  generating  the  Input-vector  are 
equivalent,  from  the  point-of-vlew  of  the  assodator,  to  the  coin  method  with  pF  given  In  (5.10). 


5.2.3.  Throughput  of  the  Assoclator 

To  ascertain  the  throughput  of  the  first  stage  of  the  classifier  In  figure  5-3  we  must  consider  the 

probability  distribution  of  the  components  of  Gt' .  For  j—  1.2 . nQ  .  we  show  that  the  probability 

that  Gk)"  =  Gk)  Is  Independent  of  j .  Calling  this  probability  pG  .  It  Is  shown  to  be  a  function  of  the 
probability  defined  earlier.  Consequently,  the  output  Information  I(Gk"\Gk),  Itself  a  function  of 
PG  ■  Is  a  function  of  the  Input-Information  HF  k  ;  F fc)  . 
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To  assess  pG  .  note  that  Gk"  Is  produced  from  G^'  via  the  'Hopfleld*  24.  25,  non-llnearlty 


i  |  {Gk;  *  ° 

-1  otherwise 


(5.11 ) 


The  probability  that  G  k”  =  G k-  Is  the  probability  that  G kjG k-  >  0  since  the  two  relations  are 
equivalent.  As  a  result,  we  can  compute  pG  once  the  probability  distribution  of  Gkj  Gk-  Is  known. 
Using  the  fact  that  G^'  =  WF^'  where  W  Is  given  by  (2.19)  we  have 


M 

G.  !G. .  =  Y  (F  F  ')G  G.. 

k i  kj  m  k  '  mj  kj 

m*l 

M 

=  "VW/  +  £  (eme»')cm)cw  <».m 

^  h 

Using  methods  outlined  In  the  chapter  on  notation,  page  10,  the  probability  function  of  the  term 
(F*‘F*')^*y2  In  (5.12),  call  It  the  'first  term*,  can  be  determined.  The  same  can  be  done  for  the 
summation  (call  It  the  'second  term*)  In  (6.12).  Both  the  first  term  and  the  second  term  are  sums  or  1.1. d. 
r.v.'s  so  that  the  central  limit  theorem  Implies  the  two  are  both  normally  distributed.  The  sum  of  two 
Independent  normal  r.v's  Is  normal  so  we  conclude  that  G.!G..  Is  normal.  The  mean  or  G,!G..  Is  the 

kj  k)  k]  kj 

sum  of  the  means  of  the  first  and  second  terms  of  (5.12)  and  similarly  Tor  the  variance.  Recalling  that 
F k  Is  generated  by  the  coin  method  with  pF  =  1/2  +  r/[2nj) ,  the  mean  or  the  first  term  Is 
n^2pF  -  1)  and  the  variance  Is  4p^l  -  pF ) .  The  mean  of  the  second  term  Is  zero  and  the  variance  Is 
(.V/-l)n/.  Therefore  the  mean  or  GkjGk-  Is  n^2pF~  1)  =  r  and  the  variance  Is 
4n;p^l  —  pF)  +  (Af-  ljrij.  The  latter  Is  very  nearly  equal  to  A/n/  for  any  value  of  pf  provided 
M  >  10  . 


Before  calculating  pG  In  terms  of  pF.  we  make  some  observations  with  regard  to  the  effect  of 

generating  F'  on  the  distribution  of  G.’G...  Whm  M  >  10,  the  variance  of  G.  !G.  .  Is 

k  kj  k)  —  k)  kj 

determined  entirely  by  the  second  term  of  equation  (5.12).  The  balanced-BernoullI  vectors. 

Fm  .  m  ft  k  .  appearing  In  the  second  term  are  Independent  of  F k  regardless  of  how  F k  depends  on 

Fk  (see  chapter  2.  page  18.  concerning  dot-product  Independence).  Thus  the  mean  and  variance  of 

F*'  WH1  not  not  be  affected  by  any  of  the  methods  of  generating  a  ±1- vector  F^'  from  F^  .  From 
this  we  see  that  the  variance  of  the  second  term  will  always  be  (A/ -  1  )n;  Irrespective  of  the  method  of 
generating  F^'  Since  F k  Is  a  ±l-vector.  the  variance  of  the  first  term  of  (5.12)  can  never  exceed  . 
The  first  term  will  therefore  not  contribute  substantially  to  the  variance  of  G kj  G k  -  under  any  method  of 
Input-ger.eration  Also  G  kj  G  kj  Is  normally  distributed  since  the  second  term  Is  a  large  sum  of  l.l. J. 
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r.v.'s.  The  nature  of  the  first  term  Is  Inconsequential  due  to  Its  small  variance.  Further  the  mean  of 
Gk’ Gk]  (s  r  for  any  of  the  methods  given  for  generation  of  F^'  .  We  see  then  that  the  product 
Gk^Gk-  has  virtually  the  same  distribution  for  any  method  of  Input-generation.  In  particular,  we  have 
that  GkJ  Gkj  ~  -V(r.  ,V/n/)  We  conclude  that  the  various  methods  of  generating  the  Input-vector  are 
virtually  equivalent  from  the  viewpoint  of  the  assoclator.  From  this  point  on.  these  methods  will  be 
discussed  Interchangeably  7 

From  this,  we  have  also  that  the  Input-Information  provided  by  the  coin  method  represents  the 
maximum  amount  of  Information  utilized  by  the  assoclator  for  any  mode  of  Input-generation.  This  can  be 
seen  by  replacing  pf-l/2  by  the  equivalent  r/(2n{)  In  equation  (5.7)  to  get 

(log,  e)r2 

'(F; ;  *V  83  ~~~ny  -  (5-i3) 

This  Information  Is  less  than  r  bits  when  r  <  n;/log2  e .  This  will  be  the  case  In  the  analysis  to  follow 
since  (5.13)  is  necessary  for  (5.7)  to  hold.  We  conclude  that  the  coin  method  provides  the  smallest  Input- 
Information  compared  with  the  other  methods  (the  neighborhood  method  provides  roughly  the  same 
amount  or  Input-Inrormatlon  as  the  coin  method).  Because  the  assoclator  sees  no  difference  In  these 
methods,  the  Input-Information  provided  by  the  coin  method  must  be  the  maximum  amount  useful  to  the 
assoclator  when  computing  the  output  vector.  The  coin  method  or  generation  can  therefore  be  used  to 
ascertain  the  performance  of  the  assoclator  despite  or  the  actual  method  of  Input-generation.  This  allows 
us  to  exploit  the  simplicity  of  analysis  afforded  by  the  coin  method  while  retaining  the  generality  to 
performance  under  the  other  Input-generation  modes. 

We  now  begin  to  calculate  the  probability  pQ  that  Gkj'  =  Gk-  which  Is  the  same  as  the 
probability  that  Gk-'Gk.  >  0.  Since  the  product  GkjGk.  Is  normal  with  mean  (2pf— l)n^  and 
variance  .  the  probability  pG  Is  easily  determined 

PC  =  PlCk)'Gh.  >  0) 

=  1  -  ™k]'Gk.  <  0) 


Th*  equivalence  of  the  neighborhood  method  to  the  coin  method  follow#  from  the  feet  that  the  vast  majority  of  vector# 
in  the  interior  of  the  ball  in  (5.SI  lie  near  the  boundary  provided  the  radius  is  les#  than  n/2  (see  Kaoerva  (28)).  The  ball 
method  tod  com  method  will  he  consistent  if  the  radius  of  the  ball  is  roughly  n^l-p^  (see  appendut  B).  The 

distribution  of  vector#  genented  vn  either  method  is  that  of  t  ‘ring*  surrounding  the  central  category-prototype.  The 
•thickness*  of  the  ring  being  determined  by  the  variance  of  the  coin  method. 


5fl 


=  1  —  Pr[Gk!Gki  Is  (2p^,—  \\n^y/  Sfn^  standard  dev's  below  the  mean) 


—  1  -  <t> 


~(-PF-  l)nF 

%/mJT 


=  f({2pF~  lWn{;  M)  since  #(x)  =  1  -  $(-  i) 


(5.14) 


where  4>  Is  the  standard  normal  distribution  function.  Since  PG  <  1.  and  M  will  generally  be  larger 
than  n/,  it  follows  that  {2pF  -  i)Vn^/M  is  typically  less  than  1.  This  allows  use  of  the  Taylor 
approximation  to  <P  given  In  chapter  2  page  10.  We  get 


1  1  -  1  - 

=  -  +  ;Vn//iU(2pf-  1)  =  -  +  V2ry*A/(j>f  -  1/2) 

2  V2* 


In  a  manner  similar  to  the  derivation  of  equation  (S.7)  we  have 


f(Gk”:Gk)  >  nQ(  i  -  »(pa)) 


(5.15) 


2n0(log2  e)(pc  -  V 


0.5  <  p  <  0.88 


(5.10) 


Assuming  pG  Is  in  the  stated  range,  we  appeal  to  (5.15)  and  substitute  V?n[/nM(pF—  1/2)  for 
PG  -  1/2  In  (5.10) 


2n. 


f(G4";Gk)  >  2n0(log2e)  —  (pf- 1/ 


2  n„ 


(5.17) 


where  the  second  approximation  Is  due  to  (5.7).  Dividing  by  I(T  k ;  F  k)  (assumed  larger  than  zero),  we 
have  a  lower  bound  on  the  throughput  of  the  assoclator 


2  n, 


nwi  ^ 


(5.18) 
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5.3.  Classifiable  Inputs 


5.3.1.  Lower  Bounda  on  Input  Information 

As  stated  earlier,  the  redundancy,  R.  must  be  larger  than  1/7TW)  for  reliable  classification.  Now 
that  the  throughput  of  the  assoclator  has  been  found,  we  have  the  lower  bound 

tr.Vf 

R  >  -  (510) 

2no 

By  definition  (5.3),  the  Input-Information  Is  given  by 

l(Fk' :  Fk)  =  /?log2  M  (5.20) 


Together.  (5.10)  and  (5.20)  Imply  a  lower  bound  on  the  Input-Information 

ffA/logj  M 

H  F.';F.)  >  -  (5.21) 

*  *  2no 

By  our  assumption.  F^'  Is  generated  by  the  coin  method.  Thus  the  bitwise  Information 

:  Fki) ,  i=1.2 . M  Is  independent  of  i  =  {l.  2 . M)  .  Also  the  Input-Information 

HF k  :  F k)  Is  given  by  (5.6).  We  conclude  that  the  Input-lnrormatlon  Is  times  the  bitwise 
Information.  Dividing  relation  (5.21)  by  rij .  we  get  the  lower  bound 


"Mi og2  M 

2 N 


(5.22) 


for  the  blt-wlse  Information. 


5.3.2.  Lower  Bounda  on  Feature  Site 

We  can  obtain  minimal  requirements  on  p F  and  r  by  Inverting  the  approximations  of  (5.7)  and 
(3.13)  to  get  each  parameter  In  terms  of  I(F k' ;  F^)  .  From  (5.7)  and  the  assumption  that  PF  >  1  /  2  we 
have 

1  - 

PF  =  -  -  v'/tF^'  :  F k),  (2n/Jog;  e) 


l  , - 

=  -il  -  V -J/l Ffc'  ;  Ft)  In^og,  e)  ) 
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The  relation  for  r  Is  obtained  from  (5.13).  (5.23)  and  the  fact  that  r  =  (2pf  —  l)n^ 


r  =3  '/2nrHFk'  ;  F^),  log,  e  (5.24) 

where  /( Ft' :  F^/'logj  e  Is  the  input-information  In  natural-logarithm  units  or  *nats*.  Using  equation 
(5.20)  we  get  pF  In  terms  of  the  redundancy 

PF  =  -  +  V R\n  M;(2nl)  (5.25) 

Similarly  for  r  . 


r  =3 


V2nIRln  M 


(5.28) 


The  lower  bound  (5.19)  Tor  R  gives  a  lower  bound  for  each  parameter 

1  - 

PF  >  -(1  -f  vrr.V/ln  M/N) 

2 

and 

r  >  Min  St 


(5.27) 


(5.28) 


This  means  that  If  F^'  Is  generated  from  F^  by  copying  r  of  F^  's  components  we  need  to  copy  at 
least  |V(nr  nQ)?r.V/ln  .V/]  components  for  classification  to  be  possible.  Reliable  classification  requires 
that  this  number  be  the  minimum  feature-size  allowable  for  the  Input-vector  If  It  Is  a  single-feature  vector. 
The  number  of  non-overlapping  features  (sub-vectors)  an  Input-vector  can  have  Is  obviously  the 
dimensionality  of  the  vector  divided  by  the  minimal  feature-size  [n/(  [V(nj/n0)jr.Wln  A/1  J  .  If  we  let 

/ _ ■  be  the  minimal  feature  size  and  n  be  the  maximal  number  of  non-overlapping  features 

allowable  In  an  Input-vector,  then  we  have  roughly 


/ 


m\n 


VI nf  n^.cA/ln  St 


and 


(5.29) 
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n  vA\7(jr.\/ln  AY)  (5  30) 

fRM 

As  shown  later,  the  fraction  under  the  radical  In  (5.30)  cannot  be  less  than  one  for  reliable  classification. 
We  see  then  that  If  we  are  to  have  n  non-overlapping  features  In  our  vectors,  then  the  number  of  weights 
In  the  assoclator  will  have  to  exceed  *A/ln  M  by  a  factor  of  n 5  .  This  Is  a  rather  heavy  price  to  pay  for 
the  ability  to  classify  vectors  on  the  basis  of  a  single  feature. 

We  make  one  Important  observation  regarding  the  Information  content  of  an  n-dlmenslonal 
±l-vector.  If  X  Is  the  number  of  l's  that  occur  In  a  balanced-Bernoulll  vector.  A,  then  X  Is  a  r.v. 
with  mean  n/2  and  standard  deviation  V^/2.  It  stands  to  reason  therefore,  that  a  sub-vector  or  A  of 
length  sTn,  2  represents  a  unit  of  Information  of  A  .  To  verify  this,  let  R  be  the  redundancy  (as  defined 
by  (5.3)  for  some  M  >  0  )  of  the  Information  that  A  Is  to  provide  about  another  vector,  B  .  If  we  are 
to  copy  components  of  B  to  A  ,  then  equation  (5.26)  gives  the  minimal  number  r  of  components  that 
should  be  copied  (the  rest  are  chosen  independently  of  the  components  of  B  ).  This  number  can  be 
expressed  In  terms  of  the  number  of  standard-devlatlon-length  sub-vectors  needed 

r  =  2V,2.ffln  M(s/n/ 2)  (5.31) 

To  provide  /?log2  M  bits  of  Information,  we  must  copy  at  least  2>J2 RuT\f  sub-vector  ‘units*  of 
Information  Trom  B  .  The  ‘square-root*  relationship  between  the  number  of  bits  of  Information  and  the 
number  of  sub-vector  ‘units*  Is  due  to  the  quadratic  dependence  of  Information  on  the  probability  that  a 
component  of  one  vector  agrees  with  Its  counterpart  In  another  vector  (see  relation  (5.7)).  The  fact  that 
Information  In  balanced-Bernoulll  vectors  Is  closely  related  to  v'v  2-length  sub-vectors  must  play  a  part 
of  any  mode  of  representation  that  codes  Information  Into  ±l-vectors.  If  Information  coded  Into  sub- 
regions  of  the  Input-vector  Is  to  provide  the  sole  cue  to  an  assoclator  for  classification,  the  subregions 
must  cover  at  least  2v?Rln  AY  sub-vector  ‘units*  of  the  Input-vector,  where  R  Is  the  minimal  Input- 
redundancy  required  by  the  assoclator. 

5.3.3.  Fraction  of  the  Input  Space  that  ia  Classifiable 

An  analysis  of  minimal  requirements  for  the  neighborhood  method  of  Input-generation  are  derived  In 
appendix  8.  Because  this  method  Is  roughly  equivalent  to  the  coin  method  and  because  It  gives  us  an 
estimate  of  the  number  of  vectors  that  can  be  classified,  we  relate  the  results  here.  First,  for  a  ball 
centered  about  an  Input-prototype,  If  a  randomly  chosen  vector  from  the  ball  Is  to  provide  Rlog,  ,\f  bits 
of  Information  about  the  prototype,  then  the  ball  must  comprise  M~ *  of  the  Input-space.  From 
appendix  B,  the  radius  p  Is  roughly 
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n{  'fn[  _ 

a  S3  —  - V 2/?l n  St  -  In  (4>r/?ln  A/)  (5  32) 

2  2 


The  lower  bound  on  the  redundancy  In  (5.10)  gives  an  upper  bound  on  the  radius 


P 


v  -■  ■  — 

- Vn \fin  \f/n0  —  In  (2*2A/ln  Sf)/nc 


(5.33) 


In  appendix  B.  geometrical  considerations  of  the  output  space  suggest  that  this  radius  Is  too  large.  The 
excess  redundancy  required  however  should  not  be  more  than  twice  the  minimum  (see  appendix  B  for  a 
discussion  of  this  point).  This  gives  us  a  lower  bound  for  p 


P 


^  - 

— — V2irAfln  M/nQ  -  In  (4?r2Mn  Sf)/nQ 


(5.34) 


We  now  derive  the  upper  bound  on  the  fraction  of  the  Input-space  that  can  be  classified.  This  result 
is  obtained  from  the  lower  bound  on  the  Information  required  at  the  assoclator  Input,  since  the  assoclator 
produces  an  output  on  the  basis  of  the  Hammlng-dlstance  between  the  Input-vector  and  the  Input- 
prototypes.  input-vectors  providing  the  assoclator  a  specified  amount  of  Information  about  an  Input- 
prototype  should  come  from  a  set  of  vectors  nearest  to  the  prototype.  If  the  set  Is  a  ball  of  radius  p 
about  the  prototype,  then  random  selection  of  a  vector  rrom  the  ball  (neighborhood  method  of  Input- 
generation)  Is  roughly  equivalent  to  the  coin  method  of  Input-generation  when  p  =»  n^l  —  pF)  .  When  an 
input-vector  F^'  Is  generated  by  the  neighborhood  method,  and  the  Information  It  provides  about  is 
HF ;  F k)  .  the  ball  It  comes  rrom  will  encompass  exp,(— /(Ft' ;  F  k))  of  the  total  Input-space.  For  our 
system,  there  are  ,Vf  balls  surrounding  St  Input-prototypes  so  the  total  fraction  of  the  Input  space 
covered  by  the  St  balls  Is  at  most  A/exp2(-/(Fk' ;  F^)) .  The  regions  could  overlap,  though  the  overlap 
will  be  negligible  If  the  Input-Information  Is  at  least  21og„  A t .  Now  If  R  Is  the  redundancy  of  the  Input, 
then  the  Input  information  Is  /?log0  M  bits  and  the  fraction  C  of  the  Input-space  that  Is  classifiable  Is 

C  =  Ul~R  (5.33  i 

I'slng  the  lower  bound  on  R  we  have  the  upper  bound  on  C  In  fact,  as  we  shall  see  later.  St  will 
usually  be  greater  than  nQ  by  a  large  factor  so  that  the  fraction  of  the  space  that  Is  classifiable  win  te 


quite  small. 
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C  <  A/ 1  “  (5.38) 

where  Af  Is  assumed  to  be  larger  than  nQ  . 

5.3.4.  Restrictions  on  Matrix  Dimensions 

The  Inequality  of  (S.33)  Is  required  for  reliable  classification,  whereas  Inequality  (5.34)  Is  merely  a 
reasonable  bound  on  how  small  the  value  of  p  need  be  made  to  Insure  the  system  will  work.  Therefore 
Inequality  (5.33)  must  be  larger  than  zero  If  the  system  Is  to  classify  Its  Inputs.  This  constraint  leads  to  a 
lower  bound  on  .V  which  will  be  derived  by  different  means  later  (see  equation  (5.42)).  The  lower  bound 
on  jV  Is  the  minimal  number  of  vectors  required  merely  for  storing  the  prototypes  when  the  Hopneld 
non-linearity  Is  present  at  the  assoclator  output. 

An  even  tighter  constraint  on  the  required  matrix  size  Is  obtained  when  we  require  that  the  system 
be  capable  of  classifying  ’highly-degraded*  Input-vectors.  A  highly-degraded  Input-vector  Is  a  vector  that 
Is  nearly  orthogonal  to  Its  category-exemplar  (the  nearest  Input-prototype).  From  (5.33),  we  see  that 
classification  of  such  Inputs  Is  possible  when  Is  large  compared  to  v/rrA/)n  M-Vn[/n0  .  In  this  case.  If 
p  Is  near  the  theoretical  maximum  given  In  (5.33).  the  Input-vectors  at  the  edge  or  the  neighborhood  or  a 
prototype  will  be  at  a  Hammlng-dlstance  nearly  n{/2  from  the  prototype.  A  reasonable  way  to  make 
n;  large  enough  Is  to  require  >  8\/)rA/ln  M  Vn{^n0 .  Multiplying  through  by  nQ.  nJ  and 
squaring  both  sides  of  this  Inequality  gives  us  a  lower  bound  on  the  number  M  of  weights 

.V  >  84  tr  A/In  Af  (5.37) 

Comparing  this  to  the  requirement  (5.42)  for  storage,  we  see  that  classification  of  ’highly-degraded* 
Input-vectors  requires  roughly  50-100  times  the  number  of  weights  required  for  merely  storing  the 
prototype  vectors. 

We  note  a  few  restrictions  on  the  parameters  Inferred  by  the  analysis  In  appendix  B.  First.  If  the 
Input-vector  Is  to  have  a  redundancy  no  greater  than  R  (keeping  R  low,  makes  a  larger  portion  of  the 
Input-space  classifiable,  see  equation  (5.35)),  then  we  must  have  p  >  0  In  equation  (B  6).  page  109.  This 
becomes  the  constraint 

>  2/?!n  .V/  (5.38' 

This  constraint  applies  equally  well  for  the  output  dimensionality  with  R  between  I  3nd  2  so  that 
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nQ  >  21  n  A/ 


(5.30) 


Is  a  minimal  requirement  for  the  output-dimensionality  (see  equation  (B.8)).  In  the  'throughput*  section, 
restrictions  on  the  parameters  n^.  nQ  and  A/  were  also  made  to  obtain  the  approximations  used  to 
obtain  the  assoclator  throughput.  The  linear  approximation  made  In  equation  (5.14)  assumed  that  M 
was  at  least  as  large  as  .  This  assumption  assures  that  the  argument  to  <t>  was  no  larger  than  1  so 
that  higher  terms  In  the  Taylor  approximation  to  <P  can  be  dropped. 

The  assumption  that  the  argument  to  <P  In  equation  (5.M)  was  less  than  1  leads  to  a  restriction  on 
PG  .  This  assumption  together  with  (5.15)  gives  the  upper  bound 


l  1 


(5.40) 


These  relations  Illustrate  the  limitations  of  the  theory  that  has  been  developed.  A  designer  of  an 
assoclator  on  A/  associations  must  stay  within  the  parameter-assumptions  In  order  for  the  performance 
predictions  of  the  theory  to  apply. 


5.4.  Performance  Degradation  Due  to  Non-Linear  Output 

The  'Hopfleld  non-llnearlty'  In  figure  5-3  was  Introduced  Tor  the  sake  of  simplifying  the  analysis. 
The  problem  of  determining  the  Information  /(G^' ;  Gk)  available  directly  from  the  assoclator- matrix  Is 
somewhat  more  difficult  than  finding  the  Information  /( Gk":Gk)  available  from  the  non-llnearlty. 
Unfortunately,  however,  addition  of  the  non-llnearlty  eliminates  much  of  the  Information  available  from 
Gk' .  That  this  Is  so  Is  evidenced  by  the  degradation  of  storage  capacity  due  to  the  non-llnearlty. 

To  estimate  the  storage  capacity  of  the  non-llnear  assoclator  In  figure  5-3.  put  PF  =  1  to  constrain 
the  Input  vectors  to  belong  to  the  set  of  Input-prototypes.  The  formula  pG  that  gives  pG  In  terms  of 
PF  becomes 


pc  ss  -d  -  V2nl;n\f)  (5.41) 

This  approximation  Is  good  when  pQ  Is  near  1/2.  so  In  particular.  M  must  be  at  least  n;  an  (5.41). 
The  approximation  was  obtained  from  (5.15)  which  Is  a  linearization  of  the  normal  distribution  function 
<Pix)  about  i=0  It  overestimates  pG  with  the  overestimate  becoming  large  as  pG  nears  1.  In  fact 
one  pays  a  high  penalty  In  storage  capacity  when  Insisting  that  each  bit  of  G^"  match  Its  counterpart  In 
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Gk  with  high  probability  This  Is  due  seen  from  the  fact  that  when  A /  Is  Increased  pc  does  not 
increase  as  rapidly  as  (5.15)  would  Indicate.  In  any  event,  using  equation  (5.15)  will  give  an  upper  bound 
on  the  storage  '-apaclty. 


As  stated  In  the  chapter  on  storage  capacity,  useful  storage  requires  the  output  Information  to  be  at 
least  log,  Xf  bits.  During  retrieval,  the  number  of  bits  present  at  the  Input  Is  n,  .  If  we  multiply  n;  by 
the  throughput  7TW)  and  require  the  result  to  be  larger  than  log2  XI ,  a  constraint  on  the  matrix  size  Is 
obtained.  Unfortuneately  TtW)  was  obtained  by  assuming  pF  was  not  too  near  l.  We  win  have  to  use 
equations  (5.41)  and  (5.18)  Instead  to  get  the  constraint.  Remember  however.  (5.41)  assumes  pG  Is  not 
too  near  1.  which  will  be  the  case  If  Xf  >  2 .  From  (5.41)  and  (5.18)  we  have 


IiG’’ ;  Gk) 


•Vtogj  e 
nXf~ 


By  the  constraint  (5.2).  the  right-hand-side  must  be  larger  than  log2  Xf .  The  resulting  Inequality  can 
then  be  rearranged  to  get 

?rA/ln  Xf 

- — —  <  l  (5  42) 

To  put  (5.42)  another  way,  N  must  be  at  least  O  (Mn  Xf)  .  This  Is  a  stronger  requirement  than  the  one 
derived  for  storage  In  the  previous  chapter.  This  new  bound  Implies  that  ir  nQ  Is  O  (In  Xf)  .  then  n; 
must  be  O  (Xf)  . 


ir  errors  are  allowed  at  the  output  of  the  second  stage  of  figure  5-3  then  the  storage  can  be 
Increased.  If  P(  Is  the  error  probability,  then  for  0  <  P(  <  1/2.  Xf  large,  we  need  (1  —  Pe))og,  Xf 
bits  at  the  output.  From  this  and  (5.18)  we  have 

2n0(log2  e)(pG  -  1'2)2  >  (1  -  Pe)log2  Xf  (5.43) 


and  from  (5.41) 


-nO(l08d 


( 1  -  Pt) log2  Xf 


which  gives 


?.\/)n  Xf 
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(5.441 


As  with  the  ease  with  storage  treated  lo  the  previous  chapter,  the  number  of  required  weights  Is 
proportional  to  i  -  P  On  the  other  hand,  the  maximal  value  of  M  no  longer  Increases  In  proportion 
to  1/(1  -  Pt) . 

The  reason  the  non-llnearlty  decreases  the  Information  content  of  the  output  of  the  assoclator  Is 
that  It  forces  the  best-match  process  of  figure  5-3  to  ‘count*  the  number  of  places  that  the  output  Gk 
disagrees  In  sign  with  Gk  (recall  the  method  of  computing  Gk” ).  This  can  be  seen  from  figure  6-3  with 
the  non-llnearlty  removed  and  from  equation  (5.12)  which  Is  the  formula  for  one  summand-term  In  the 
dot-product  G^'  .  If  the  best-match  process  In  figure  5-3  uses  the  output  of  the  assoclator- matrix 

directly,  It  can  use  the  dot-product  similarity-measure  to  compare  Gk’  with  every  one  of  the  output- 
prototypes.  Now,  a  single  summand  In  the  dot-product  Gk!Gk-  Is  blnomlally  distributed  with 
positive  mean  (2 pF~  ljn^  .  Such  a  term  will  tend  to  have  larger  magnitude  when  It  Is  positive  than  when 
it  Is  negative.  This  means  that  the  dot-product  can  do  more  than  'count*  how  many  positions  Gk! 
agree  In  sign  with  their  counterparts  Gk- .  The  dot-product  also  uses  'magnitude*  Information  to 
ascertain  the  ’confidence*  that  a  specific  component  of  Gk  Is  of  the  proper  sign.  On  the  other  hand, 
whether  the  performance  limits  of  the  previous  chapter  can  be  achieved  depends  on  whether  retrieval  In 
the  llnear-assoclator  Is  optimal.  For  this  to  be  so.  the  full  entropy  of  the  matrix  (per  storage  Item)  must 
be  available  at  the  memory  output.  What’s  more,  the  Information  available  must  be  useful  to  the  best- 
match  process. 

The  analysis  of  the  linear  case  should  entail  evaluation  of  the  Information  content  of  Gk  by 
evaluating  It  as  a  rendition  or  the  ’signal*  Gk  with  added  binomial  *nolse*.  The  *slgnaI-to-nolse  ratio* 
as  a  function  of  M  would  then  be  used  to  quantify  the  Information  content.  The  analysis  Is  similar  In 
concept  with  evaluation  of  Information  contained  by  a  gausslan  signal  In  the  presence  of  gausslan  noise 
(see  Gallager,  [12,  p.  32.  Example  4|).  The  difference  Is  that  the  ’signal*  components  Gk .  are  not 
gausslan  but  Bernoulli  r.v.’s  and  the  *nolse*  In  G^'  due  to  the  assoclator-matrlx  Is  binomial  rather  than 
gausslan.  These  differences  are  responsible  for  the  difficulty  In  determining  the  Information  l[Gk' ;  Gk) . 
The  difficulties  are  not  Insurmountable,  but  the  analysis  may  be  as  Involved  as  that  In  Appendix  A.  since 
the  problem  of  approximating  a  discrete  entropy  with  a  continuous  one  In  the  appendix  seems  related  to 
the  problem  of  approximating  the  Information  In  G^'  . 

5.5.  Classifier  Design  Considerations 

At  this  point, we  are  ready  to  illustrate  the  design  of  an  assoclator  to  meet  specific  requirements. 
Two  designs  will  be  given  to  show  how  the  relative  sizes  of  parameters  Interact.  Given  the  number  \f  of 
■ategories.  a  fraction  a  of  the  space  to  be  classified  and  the  maximum  classification  error- probability , 


P  we  wish  to  And  the  dimensions  and  nQ  that  result  In  a  matrix  of  minimal  size  ,V  that  meet  the 
requirements.8  To  begin,  let  P  =  0  for  simplicity.  Notice  that  a  ball  Bk(p)  about  a  prototype  must 
contain  about  a;  M  of  the  Input  space.  Since  the  fraction  of  Input-vectors  In  the  ball  Is 
exp2(-/(F k' ;  F4)) .  we  have 

“  =  exp2(-/(Ft';Ft))  (5.45) 


so  that 


KF; :  F4)  =  log2  M  -  logj  a 


Now  /?  =  /( F4' ;  F4)/log2  M  so  by  (5.48)  we  have  log^  a  -  logj  A/  = 
and  converting  to  natural  logarithms  gives  a  more  convenient  form 


-/TlogjA/.  Rearranging 


R  =  1  +  —77  (5.47) 

In  M 

The  two  classlAers  we  produce  will  be  called  the  large»a  model  and  the  small* a  model.0  The  large-a 
model  will  have  —In  a  proportional  to  In  M,  so  that  for  seme  positive  K  >  10  we  write 

—In  a  =  Kin  M  (5.48) 

The  small-a  model  assumes  that  -In  a  Is  proportional  to  M .  In  this  case  we  put 

M 

-In  a  =  — :  (5.40) 

K 

with  K  <  A//(10ln  A/) .  Calculating  the  redundancy  from  (5.47)  for  the  large-a  model  we  have 

R  —  l  -*-  K  K  (5.50) 

and  for  the  small-a  model 


Of  course,  a  design  problem  may  differ  as  to  which  parameters  are  initially  specified.  Most  ootably  is  the  case  when  a 
designer  is  dealing  with  an  input-space  whose  vector-dimensionality  is  already  known. 

g 

Since  0  <  a  <  1  .  the  quantity  —In  a  is  positive  and  grows  without  bound  as  a  -»  0  The  terms  *large-o*  and 
•small-o*  are  of  tonne  relative.  A  large-a  model  will  only  classify  a  small  portion  of  the  input-space.  A  small-a  model  w,|| 
classify  a  portion  orders  of  magnitude  smaller.  Even  in  the  case  of  the  small-a  model  however,  there  are  e*p„(np  poss.tle 
input-vectors  so  that  the  actual  number  of  vectors  classifiable  is  still  very  large. 
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R  =  l  - 


M 


,Vf 


K In  ,U  Kin  M 


iS.Sl) 


Recall  that  relation  13.19)  must  hold  for  reliable  classification.  From  this  we  get  the  lower  bound  on  n 


n  K{ 

n~  >  — — 

°  2ft 


(5  52) 


For  the  large-a  model,  this  Implies 


n0  >  nM/{2fC) 


(5.53) 


For  the  small-a  model 


—  -K\n  M 
O  2 


(5.5-1) 


To  get  a  constraint  on  .  we  use  the  fact  that  the  maximum  Hammlng-dlstance  between  an  Input- 
vector  an  Its  category-exemplar  Is  roughly 


P 


max 


1' 

2 


- v2/?!n  M 


2 


(5.55) 


U  we  are  to  classify  vectors  that  are  nearly  orthogonal  to  their  category  vectors,  then  />  should  be 

tnax 

nearly  n{/ 2  .  For  the  large-Q  model,  this  Is  more  Important  than  for  the  small-o  model  since  the  former 
must  classify  more  of  Its  Input-space.  The  closer  PmtI  Is  to  n{/ 2  however,  the  more  weights  are 
required  for  either  model  given  a  fixed  value  of  K  .  For  the  sake  of  comparison  then,  we  will  use  the 
same  value  p  =  (2/3)n^/2  for  both  models.  This  Isn't  much  of  a  constraint.  A  better  one  Is 
Prntx  —  f®-'10)«j,  2  but  the  number  of  weights  required  would  be  about  10  times  as  large.  From 
equation  (5.55)  and  our  constraint,  we  get 

~  Z'/zRuTs'f 


so  that 


n 


t 


isR)  n  \f  =»  20/?)  n  \f 


(5.58) 
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For  the  large-a  model  R  K  so 

n{  =  20/On  Af  (5  57) 

whereas  the  small-a  model  has  /?  =  Af,  (ATln  Af)  so 


20  A  f 

rij  =  —  (5.58) 

The  number  ;V  of  weights  In  both  cases  Is  lOTr.Vfln  Af  or  10  times  the  minimum  required  Tor  storing  Af 
prototypes. 

The  thing  to  notice  Is  that  the  large-a  model  has  of  order  In  A/  and  nQ  of  order  Af .  In  other 
words,  the  input-dimensionality  far  exceeds  the  Input-dimensionality.  In  order  to  classify  such  a  large 
portion  of  the  Input-space,  the  Input-redundancy  must  not  be  large.  This  Is  seen  rrom  relation  (5.47). 
When  a  -*  l  .  we  have  -In  a  —  0  so  that  R  1  .  The  throughput  of  the  system  must  be  large  so 
many  units  are  needed  to  produce  the  output. 

For  the  small-a  the  situation  Is  reversed.  The  lnput-dlmenslonallty  Is  large  and  so  can  accomodate 
the  large  input-redundancy  (The  redundancy  can  never  exceed  n^/ log,  Af).  The  number  of  units  can  be 
small  since  the  high  redundancy  Insures  adequate  output  Information  even  with  low  throughput. 

As  a  numerical  example,  suppose  that  Af  —  50,000  and  to  assure  Af  >  In  (5.58),  let 
K  =  50.  For  the  large-a  model,  R  =  60  so  by  (5.57)  rij  10.800.  and  by  (5.53)  nQ  1570.  For 
the  small-a  model  R  —  02  .  equation  (5.58)  Implies  n{  =  20.000  and  (5.54)  gives  nQ  =  850  .  Both 
models  have  roughly  1.7  107  weights. 

Now  let  f  be  the  number  of  classifiable  vectors  In  each  case.  We  want  to  estimate  the  entropy 
log2  f  of  the  classifiable  portion  of  the  Input-space.  By  equation  (5.35).  this  entropy  Is  roughly 
log,  (Af 1  “  *»xp2(nj)) ,  or  approximately  f  =  n/  +  (1  —  R) log2  Af .  By  equation  (5.47)  we  have 

C  =  rij  -i-  log2  a  (5.50) 

For  the  large-a  model.  f  =  -  /Clogj  Af  =»  10,000  .  For  the  small-a  model, 

f  —  A/log,  e  K  =3  18,800  .  The  proportion  of  the  space  classified  by  the  large-a  model  Is  10 -24° 

whereas  the  small-a  model  classifies  roughly  10  _44°  of  Its  Input-space  (computed  from  the  respective 
values  of  a  ). 
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The  moral  however.  Is  that  the  small-a  does  not  classify  fewer  vectors  than  does  the  large-a  model 
The  Input-space  for  the  small-o  mode!  Is  so  much  larger  than  that  of  the  large-a  model  that  the  actual 
number  of  vectors  classifiable  by  the  small-a  Is  much  larger.  In  fact,  the  number  of  vectors  that  can  be 
classified  by  the  small-a  model  dwarfs  the  number  of  vectors  In  the  entire  Input-space  of  the  large-a 
model. 

One  way  of  viewing  this  numerical  advantage  of  the  small-a  model  Is  In  terms  of  data 
compression.  Whereas  the  number  of  Input-vectors  to  be  classified  Is  potentially  very  large,  the  number 
M  of  categories  at  the  output  Is  relatively  miniscule  (the  number  of  categories  should  be  less  than  the 
number  of  weights  or  even  smaller).  The  entropy  of  the  output  relative  to  that  of  the  Input  Is  therefore 
quite  small  and  this  Is  what  Is  meant  by  *data-compresslon*.  The  fact  that  the  matrix  faces  less 
Information  at  Its  output  than  at  Its  Input  should  be  reflected  by  Its  architecture  If  high-performance  Is 
expected.  For  a  classifier  with  N  weights  that  is  to  classify  a  large  number  of  Input-vectors,  the  output- 
dimensionality  should  be  as  small  as  possible  (within  the  constraints  described  In  appendix  B)  compared 
with  the  input-dimensionality.  Such  a  system  will  classify  a  maximal  number  or  Input-vectors  for  a  given 
number  of  associations  (categories)  stored. 

One  should  also  notice  that  the  classifier  classifies  only  a  very  small  portion  of  the  Input-space.  This 
results  In  a  •doubie-data-compresslon*.  Most  Inputs  are  simply  not  considered  to  be  valid  Input  ’signals*. 
Those  that  are  will  then  be  mapped  to  a  relatively  small  number  of  categories.  The  final  result  is  an 
output  that  has  far  less  entropy  than  the  total  Input-space.  We  conclude  that  the  assodator-as-classlfler 
assumes  that  most  or  the  space  or  possible  Inputs  are  Irrelevant  to  Its  task.  The  portion  or  the  space  that 
is  considered  relevant  Is  specified  by  the  collection  of  prototype-vectors.  These  In  turn  specify  the 
pertinent  Informational-features  of  the  Input-space.  All  other  Information  Is  Ignored,  resulting  In  an 
output  that  Is  a  compact  representation  of  the  salient  features  of  the  Input. 


5.6.  Maximal  Performance  and  Figures  of  Merit 

5.8.1.  Merit  Parameters  and  Figures  of  Merit 

We  define  a  merit-parameter  to  be  some  measure  of  system  performance  with  regard  to  storage 
or  classification.  In  the  case  that  there  Is  a  maximal  value  for  the  merit-parameter,  we  divide  the  merit- 
parameter  by  the  maximal  value  to  get  a  flgure-of- merit.  TDe  maximal  value  for  the  parameter  Is 
determined  via  Information-theoretic  constraints  on  an  arbitrary  memory /classification  system  and  so  Is 
Independent  of  features  specific  to  a  particular  device.  The  flgure-of-merlt  will  generally  take  on  a  value 
between  ;ero  and  one  with  the  value  *1*  corresponding  to  optimal  performance.  Thus  the  merit-figure 
can  be  used  for  comparison  of  various  systems  whose  merit-figures  are  known. 
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5.8.2.  Load,  Efficiency,  Throughput  and  Retrievable  Information 

In  the  chapter  on  storage,  we  derived  a  flgure-of- merit  L  called  the  load.  It  was  defined  as  the 
ratio  of  the  number  of  Items  stored  (a  merit-parameter)  divided  by  the  number  of  Items  storable.  Another 
ngure-of-merlt  we  defined  was  called  the  efficiency,  n  ,  that  was  the  ratio  of  the  number  of  bits  stored  In 
the  memory  divided  by  the  number  of  bits  required  to  represent  the  memory  itself.  For  classification,  it  Is 
also  desirable  to  obtain  relevant  merit-parameters  and  flgures-of-merlt. 


An  obvious  merit-parameter  for  classification  Is  the  throughput  ITW)  defined  earlier.  The  optimal 
value  T  can  be  be  derived  for  an  arbitrary  memory  obeying  relation  (3.13).  The  throughput- merit, 
r,  of  a  system  Is  then  defined  as  T(W)  /Tf .  To  obtain  Tg  ,  we  divide  the  maximum-possible  output- 
information  by  the  minimum  allowable  input-information.  For  systems  obeying  equation  (3.13).  the 
maximum  output-information  per  association  Is  H(W,  M)/M .  The  minimal  Input-Information  required  Is 
logj  Af  bits  so  we  have 


r  =  mw,  M)/(Mog2  A/) 


(5.80) 


So  the  throughput-merit  Is  given  by 


7TW) 


7TW)Mog2  M 
H{ W.  M) 


where 


A/!og2  M  M  l(Gk" ;  Gk) 

~  MW,  A/)  ~  1 


(5-81) 


If  we  use  the  fact  that  ff(W,  A/ )  <=a  (l/2)/Vlog2  A/  then  the  flgure-of- merit  r  for  llnear-assoclator 
systems  satisfies 


r 


7KW) A/log,  A / 
7T  21-Nlog,  A/ 


2  A/ 

nwh-  =  TCW)L 


(5.82) 


where  L  Is  the  load.  Thus  the  throughput-merit  for  the  outer-product  associator  Is  Just  the  product  of 
the  two  merit  parameters  derived  earlier.  This  product  however  has  the  additional  property  that  It  can 
nev»r  exceed  l.  It  would  be  of  Interest  as  to  whether  the  throughput-merit  for  the  llnear-assoclator 
(without  the  Hopfleld  non-llnearlty )  Is  roughly  equal  to  1  (or  at  least  constant)  for  a  large  range  of  values 
of  the  load  If  so.  we'd  have  that  the  throughput  trades  directly  with  load  as  more  associations  are  stored 


In  any  event,  we  have  that 
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7TW)  <  i 


.V 

2M 


(5.83) 


for  the  linear- assoclator.  For  the  assoclator  with  no  non-llnearlty  then,  the  upper  bound  can  be  quite  large 
when  M  Is  much  smaller  than  jV. 


For  the  case  that  the  Hopfleld  non-llnearlty  Is  present  at  the  matrix-output,  we  can  obtain  the 
maximum  r  achievable  by  the  assoclator  (see  figure  6-3).  By  (5.42),  the  number  ,V  In  (5.82)  Is  larger 
than  trA/ln  A/.  Replacing  N  by  this  value  In  (5.82)  gives  the  upper  bound 


2  M  27TW) 

t  <  T\wy——  —  — - 

—  'nMl n  A/  trln  M 

Since  7T W)  =  2n0/(n\f ) .  we  have  the  bound 


2n, 


4n, 


?rA/  Tin  M 


n2Min  M 


(5.84) 


(5.65) 


which  Is  much  smaller  than  1  If  the  number  of  stored  prototypes  Is  larger  than  nQ  .  By  way  of 
comparison,  the  linear  assoclator  could  conceivably  have  a  r  as  large  as  l.  However  this  has  not  been 
established  since  the  throughput  of  the  Unear-associator  has  not  been  determined. 


A  flgure-of-merlt  relevant  to  the  memory  Is  apparent  from  the  results  of  chapter  2.  By  relation 
(3.13),  we  have  KGlt"  :  G^)  <  H(W) .  Therefore  the  retrievable-fraction  of  stored  information  Is 


MI(  Gk":Gk) 

"  “  H(W) 

The  retrievable  fraction,  by  relation  (3.14),  cannot  exceed  1. 


(5.88) 


For  the  non-llnear  assoclator,  we  can  find  the  maximal  retrievable-fraction  from  knowledge  of  the 
throughput.  Remembering  that  the  largest  that  the  Input-Information  can  be  Is  bits,  we  use  the 
definition  of  throughput  to  get 


.V/7TW)/(Ffc' ;  Fk) 

u  = 


\f(2nQ,  j\f)n{ 


4 


(i  2:.V)og  \f 


< 


(1. 2),Mog,  A l 


dog„  A t 


(5.671 


7i 


This  parameter  Is  quite  small  for  large  systems  that  store  many  associations.  For  the  Hopfleld-non-llnear 
assoclator,  systems  become  extremely  sub-optimal  as  the  system-size  gets  large. 

5.8.3.  Search  for  an  Overall  Figure  of  Merit  for  Memory 

It  would  be  preferable  If  ao  overall  flgure-of-merit  for  memory-performance  could  be  found.  This 
figure,  called  the  memory-merit,  M  ,  should  reflect  all  aspects  of  memory  operation  and  have  the 
property  that  a  memory  could  In  principle  attain  a  memory-merit  of  one.  An  attempt  to  define  M  might 
Involve  taking  the  product  of  r,  ^  ,  and  tj  to  get 

M  =  pun  (5.88) 

For  memory  systems  whose  load  L  can  be  defined,  one  can  restrict  consideration  to  memories  that  are 
not  overloaded  (l.e.  L  <  l  ).  The  load  could  then  be  incorporated  Into  M 

M  ~  r^r/L  (5.00) 

The  efficiency  rj  Is  just  related  to  the  representation  used  for  the  weights  of  the  memory  and  Is  therefore 
Indicative  of  limitations  or  the  memory's  Implementation.  This  parameter  should  be  dropped  ir  only  the 
memory's  Inherent  properties  are  to  be  considered 

M  =  (5.70) 

ir  there  Is  a  g»neral  flgure-of-merit  for  memory,  this  last  one  may  be  close  to  the  mark.  On  the  otner 
hand,  we  saw  In  relations  (5.81).  (5.82)  and  (5 .00)  that  r  Is  related  to  both  p  and  L  ,  so  one  may  wonder 
If  M  in  15.701  may  contain  redundant  Information.  Also,  there  may  be  tradeoffs  that  force  the  value  of 
one  of  the  factors  In  (5.70)  to  be  low  when  the  other  Is  high.  If  this  true  even  in  principle,  then  It  Is 
possible  that  no  memory  can  achieve  a  merit  of  one  and  the  memory-merit  would  not  satisfy  the 
definition  of  a  flgure-of-merlt.  This  possibility  seems  unlikely  based  on  calculations  done  by  the  author. 
In  Tact,  ir  '.he  outer- product  llnear-associator  has  an  optimal  throughput  (  r  near  one  for  large  systems).  It 
Is  possible  that  It  could  be  have  a  memory-merit  approaching  one  3s  the  assoclator  size  gets  large. 

6. ft. 4.  Classification  Figure*  of  Merit 

For  classification.  a  merit  parameter  that  can  be  'normalized1  to  produce  a  flgure-of-merlt  Is  hard 
to  ottain  without  Imposing  artificial  constraints.  One  merit  measure  worthy  of  consideration  however  Is 
the  ratio  of  the  tits  needed  to  encode  the  classifiable  Input  set  to  the  number  of  bits  needed  to  represent 
the  '-at»gcries  as  the  output  This  Is  called  the  fao-lo.  The  parameter  Is  of  Interest  because  It  represents 
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the  capability  of  the  system  to  react  to  a  very  large  Input-space  when  It  has  stored  a  relatively  small 
■representation-space*.  Indeed,  this  Is  the  very  essence  of  classification.  A  classifier  'filters  out*  non- 
essential  Information  allowing  subsequent  systems  to  provide  for  far  fewer  contingencies.  Unfortunately,  a 
classifier  can  achieve  a  high  fan-ln  by  classifying  all  possible  Input-vectors  Into  one  'category*. 

One  remedy,  Is  to  multiply  the  fan-ln  by  the  storage-load  of  the  system.  A  system  with  a  large  load 
will  have  stored  a  maximal  number  of  categories  and  so  the  product  of  the  fan-ln  and  load  will  be 
maximized  by  systems  that  can  classify  a  large  portion  of  the  Input-space  even  when  storing  a  large 
number  ,W  of  categories.  With  this  In  mind,  we  consider  the  fan-ln  alone  when  the  number  of  categories 
Is  a  fixed  value  A/.  We  will  derive  the  optimal  of  fan-in  for  this  number  of  categories  and  use  It  to  rind 
the  'normalized*  fan-ln  merit. 


To  calculate  the  fan-ln  merit  /  for  the  llnear-assoclator,  note  that  the  logarithm  of  the  classifiable 
space  Is  roughly  n;  -f  (1  -  /?)log,  A/  by  equation  (5.38),  where  R  ts  the  redundancy.  The  number  of  bits 
needed  to  label  the  M  different  categories  Is  log2  M  bits  so  the  fan-ln  /  Is 


n,  +  ( 1  “  /?)log,  Af  n{ 

I  —  - ; - rr— —  =  ; - -  +  l  -  R 

iog2  A/  log2  M 

where  R  Is  the  Input  redundancy.  Note  that  ryto g2  M  Is  the  maximum  redundancy  that  can  be 
facilitated  by  the  Input.  To  get  a  normalized  figure  of  merit,  we  first  make  the  constraint  that  the  Input- 
space  has  entropy  nf  and  the  output-space  being  composed  of  M  categories,  has  entropy  log2  M .  Also 
note  that  R  >  1/7TW)  >  l/T  .  so  by  (5.80) 


A/log2  A/ 

R  >  - 

~  H(  W) 


(5.71) 


and  so  the  largest  value  /  of  /  Is  defined  by 


0  log.  A/ 


The  fan-ln  m»rlt  f  Is  then 


=  /  L 


A/log,  M 
Hi  W) 


15.72) 


15.731 


T  >  g»!  the  merit  for  th»  non-lln»ar  assoclator.  re-all  from  relation  15-12)  that  ,V  >  -Afln  M  so  that 
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A  good  overall  merit  parameter  for  classification  might  be  the  product  of  the  load,  the  fan-ln  merit, 
and  the  Inclusion  merit.  The  Issue  of  finding  an  overall  flgure-of-merlt  for  memory  and  classification 
might  not  be  hard  to  address.  The  author  has  only  recently  defined  these  merit  measures  and  has  not  yet 
fully  »xplored  the  alternatives. 

In  passing,  we  might  add  that  these  figures  of  merit  can  be  quantified  for  the  linear- assoclator  once 
the  throughput  of  the  linear  version  of  the  classifier  can  be  determined.  We  conjecture  that  the  llnear- 
assoclator  may  be  very  nearly  optimal  In  most  respects  when  the  matrix  side  Is  large.  As  far  as  non- 
linearities  are  '-oncerned.  any  non-llnearlty  will  cause  performance  degradation.  However,  'sigmoid*  non- 
llneariti's  used  In  so  many  connectlonlst  systems  (see  22.  21.  10  ).  will  perform  reasonably  well  If  they  are 
not  too  ':te»p*  In  particular.  ,f  the  rising  portion  of  the  sigmoid  Is  broad  enough  to  encompass  most  of 
th»  variance  of  the  components  of  the  matrlx-output-vector.  most  of  the  matrix-output  Information  win 


,  th.  Mwmpt.  a  'maximal  steepness'  necessary  for  negligible 

b<  retained.  Though  the  author  "*  **  ^  SQmechlng  „ke  the  talls-.emma  of  appendix  A.  Here,  one 

information  .oss  should  be  eas  y  o  ^  ^  ^  the  components  can  assume  as  was  done  for  the 

would  use  the  sigmoid  to  limit  t  e  ra  Hopfleld  non-llnearlty 

=r-  r:::™-"--"— = — •—  - 

stgmold-non-unear  outer-product  associators. 
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Chapter  6 
Summary 

0.1.  Contributions  and  Accomplishments 

The  most  Important  contribution  of  this  work  is  the  characterization  of  memory  and  storage  In 
terms  of  Information  theory.  For  memory,  the  primary  accomplishment  was  evaluation  of  the  matrlx- 
«  entropy  and  the  proof  that  It  bounds  the  retrievable  Information.  The  bound  was  subsequently  used  to 
determine  the  amount  of  Information  stored  as  a  function  of  matrix-size  and  number  of  associations 
stored.  A  criterion  for  minimal  performance  was  obtained  through  the  definition  of  channel  memory. 
This  criterion  was  then  used  to  bound  the  number  of  Items  storable.  We  also  dealt  with  the  notion  of 
retrieving  information  via  separate  ‘accesses*  to  the  memory,  one  Tor  each  Item  stored.  Though 
Information  obtained  this  way  Is  not  the  same  as  that  actually  stored  In  the  matrix,  we  find  that  the 
latter  Is  an  upper  bound  on  the  former. 

Use  of  the  concept  of  the  matrix-channel  allowed  us  to  characterize  and  evaluate  classification  of  the 
assoclator.  For  this,  the  fundamental  concept  defined  was  the  matrix-throughput  which  Is  the  ratio  of  the 
output  Information  to  the  Input  Information.  The  simple  linear  relation  between  the  two  for  the 
assoclator  with  Hopfleld  non-llneartiy  allowed  us  to  quantify  the  fraction  of  the  Input-space  that  Is 
classifiable  and  obtain  minimal  requirements  on  sub-features  of  Incomplete-Input  vectors  needed  for  their 
proper  classification.  We  also  noted  requirements  on  the  matrix-size  as  they  relate  to  the  task  required. 
We  found  that  an  assoclator  with  Hopfleld  non-llnearlty,  expected  to  classify  Inputs  that  are  nearly 
orthogonal  to  their  category-exemplars,  requires  50-100  times  as  many  weights  as  does  one  that  merely 
stores  Its  prototypes.  The  latter  system  Is  a  'degenerate*  classifier.  It  can  properly  'categorize*  an  Input 
vector  If  that  vector  Is  an  Input-prototype.  Such  a  system  would  not  be  very  robust  In  Its  classification  of 
Input-vectors  that  have  a  significant  number  or  'bits*  In  error.  In  any  event,  there  Is  obviously  a  tradeoff 
between  the  number  of  categories  over  which  the  assoclator  can  divide  the  Input-space  and  the  fraction  of 
the  Input-space  that  can  be  classified.  The  more  category  discrimination  required  of  the  system,  the  fewer 
vectors  can  be  classified  given  a  fixed  matrlx-dlmenslonalltles. 
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We  mention  that  In  some  sense,  the  assodator  Is  not  really  doing  classification  unless  the  output- 
dimensionality  is  very  nearly  equal  to  the  logarithm  of  the  number  or  categories  stored.  We  were  merely 
interested  In  conditions  under  which  the  assodator  would  pass  through  Information  useful  to  a  subsequent 
stage  that  Is  to  determine  the  category  to  which  the  Input  to  the  assodator  belongs  (see  second-stage  of 
S-3).  An  assodator  could  be  said  to  classify  Its  Inputs  If  the  outputs  It  produced  were  much  nearer  to  the 
output-prototypes  than  the  respective  Input-vectors  were  to  their  exemplar-prototypes.  In  the  case  of  the 
Hopfleld-non-llnear  assodator,  the  average  distance  of  the  matrix-output  from  the  'correct*  output- 
prototype  is  «0(l  -  pG)  ■  We  can  decrease  this  distance  by  forcing  pQ  to  be  near  one  or  by  keeping  nQ 
small.  The  first  of  these  can  only  be  done  by  storing  less  than  n{  categories  where  n/  is  the  dimension 
of  the  Input-vectors  (see  equation  (5.15).  page  58).  The  second  option  is  fortunately  In  keeping  with 
optimal  performance  of  the  classifier.  In  fact,  we  found  earlier  that  a  large  Input-dimensionality  allows 
classification  of  a  very  large  number  of  vectors  for  a  given  matrtx-slze  and  storage-load.  This  Is  probably 
the  most  Important  finding  concerning  assoclator-classlflcatlon.  A  matrix  that  ‘fans  In*  so  that  Its  Input-  # 
dimension  is  much  larger  than  its  output  dimension  will  give  the  best  classification  performance  Tor  a  fixed 
matrix-size  and  number  of  stored  categories.  Thus  we  have  an  architectural  specification  based  on 
information  theory.  A  classifier  does  data-compresslon  so  that  the  output-handles  much  less  entropy  than 
does  the  Input  and  the  matrix  dimensionalities  should  reflect  this  fact  for  optimal  performance. 

After  evaluation  of  the  performance  or  the  system,  we  obtained  figures  or  merit  for  both  memory 
and  classification  performance.  These  were  'normalized*  with  respect  to  optimal  Information-theoretic 
performance  limits  and  so  serve  as  a  basis  of  comparison  of  general  memory /classifier  systems.  The 
assodator  with  Hopfleid  non-llnearlty  was  shown  to  perform  suboptlmally,  In  Tact,  disappointingly  so.  On 
the  *up  side*,  the  Hopfleld-non-llnear  system  provides  a  lower  bound  for  performance  of  assoclators  with 
•sigmoid*  type  non-llnearltles. 

6.2.  Limitations  of  this  Investigation  and  Future  Directions 

The  main  limitation  of  this  work  was  that  It  did  not  address  the  Information  content  of  the  actual 
matrix-output  (labelled  G^'  In  figure  5-3).  The  problems  with  the  analysis  are  mentioned  on  page  0<. 
Once  this  Issue  Is  addressed,  one  may  be  able  to  determine  the  optimal  performance  of  any  assodator  with 
sigmoid  non-llnearlty  on  Its  output.  What's  more,  the  storage  bound  was  merely  an  upper  bound  to 
performance.  Knowledge  of  the  amount  of  Information  present  In  the  matrix-output  would  determine  Just 
how  tight  this  bound  Is.  We  also  assumed  that  the  Information  at  the  output  of  the  matrix  Is  all  useful  to 
a  second-stage  process  that  must  classify  the  output-vectors.  This  Is  not  necessarily  true  but  Is  probably  a 
good  assumption  Jue  to  the  fact  that  the  assodator  maps  similar  Inputs  to  similar  outputs  and  the  fact 
that  w*  characterized  Information  at  both  Input  and  output  In  terms  of  vector-similarity . 
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A  rather  serious  shortcoming  of  the  analysis  was  that  It  assumed  that  the  prototype  vectors  were 
chosen  randomly,  that  Is  they  were  *balanced-Bernoulll*  vectors.  In  reality.  If  a  system  acquires  Its 
prototypes  by  encoding  representations  of  •stimuli*  or  'concepts*  etc..  It  will  most  likely  have  correlated 
prototypes.  So  while  we  did  not  require  orthogonality  of  the  prototypes,  the  requirement  that  they  be 
uncorrelated  (randomly  selected)  Is  too  stringent.  The  problem  Is  confounded  by  the  fact  that  storage 
capacity  most  probably  degrades  in  the  presence  of  Inter-prototype  correlation;  the  sensitivity  to 
correlation  becomes  more  pronounced  as  the  system-size  gets  large.10  This  Is  a  serious  flaw  since  It 
Indicates  that  the  storage  capacity  may  not  be  achievable  In  practice.  On  the  other  hand,  the  relation  of 
mutual  Information  to  vector  geometry  outlined  In  appendix  B  may  provide  a  means  by  which  a  set  of 
prototypes  can  be  strategically  chosen  so  as  to  minimize  correlation  or  equivalently  maximize  mutual 
Hammlng-dlstance.  If  such  a  method  could  be  easily  Incorporated  Into  the  encoding  process,  these  systems 
could  In  fact  achieve  better-than-optlmal  performance  since  *de-correlatlon*  could  produce  prototypes 
more  mutually  distant  than  random  selection  can. 

Another  Issue  not  addressed  was  classification  performance  when  the  number  of  stored  categories 
was  less  than  the  Input-dimensionality.  The  analysis  In  the  classification  chapter  would  probably  extend 
to  this  case  if  the  linear  approximation  to  #  on  page  50  was  changed  to  a  quadratic  one  for  more 
accuracy.  Even  without  this  change  however,  the  linear  approximation  overestimates  pQ  so  the 
performance  bounds  derived  In  the  classification  chapter  apply  to  the  case  that  the  number  of  stored 
categories  Is  small.  The  upper  bound  merely  becomes  looser.  As  the  number  of  stored  categories  Is 
diminished.  pc  Increases  but  not  as  rapidly  as  the  linear  approximation  would  Indicate.  Note  that  even 
when  the  number  of  categories  Is  less  than  the  Input-dimensionality,  the  analysis  applies  to  randomly 
selected  input-prototypes  not  orthogonallzed  (forcefully-decorrelated)  prototypes.  This  Is  an  advantage 
since  It  represents  a  relaxation  of  the  orthogonality  restriction  needed  for  perfect  retrieval  (see  '21,  p.  18). 

Regarding  future  directions,  there  are  too  many  possible  avenues  for  continuing  this  work  to 
mention  here.  Two  however  are  of  primary  concern  to  the  author.  First  Is  the  analysis  of  the  auto- 
assodator  as  both  memory  and  classifier.  This  extension  Is  not  without  obstac’::  however.  With  respect 
to  memory,  the  weights  of  an  outer-product  matrix  are  less  Independent  when  the  output-prototypes  are 
Identical  to  the  Input-prototypes.  On  the  other  hand,  the  Individual  weights  (excluding  those  on  the 
diagonal  which  are  constant  and  so  contribute  nothing  to  the  matrix-entropy)  will  have  the  same 
distribution  as  those  of  the  hetero-assodator  and  should  be  nearly  Independent  when  many  prototypes  are 
stored.  In  any  event,  the  matrix-entropy  of  the  assodator  Is  less  than  for  a  hetero-assodator  so  the 
storage  wil.  be  limited  accordingly.  Another  problem  regards  classification.  An  auto-associator  requires 


The  evidence  for  this  was  obtained  by  a  "cursory*  investigation  conducted  by  the  author.  This  analysis  was  not 
included  since  it  depended  upon  erroneous  independence  assumptions  of  vector  dot-products  and  so  may  have  been 
inaccurate 
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the  output-dimensionality  to  be  the  same  as  that  of  the  Input.  The  present  Investigation  Indicates  this 
condition  Is  suboptlmal  for  classification  performance. 

One  method  for  solving  both  problems  Is  to  use  a  hetero-assodatcr  (with  output-dimension  smaller 
than  that  of  the  Input)  but  feed  back  the  output  Information  In  some  constructive  fashion.  However,  even 
If  this  can  be  done,  the  amount  of  output  Information  must  be  sizable  In  comparison  with  the  amount  of 
Input  Information  present  at  the  start  of  the  auto-assoctatton  process.  If  the  amount  of  output 
Information  Is  less  than  1/2  or  1/3  of  the  amount  of  Input  Information,  the  Incremental  Increase  In 
Information  available  at  the  output  after  several  •Iterations*  of  the  auto-assoclator  will  be  only 
marginally  better  than  that  available  to  begin  with.  The  .uihor  believes  that  the  auto-assoclator  will 
therefore  have  greatly  improved  classification  performance  for  light  storage  loads  but  will  not  gain  much 
storage  capacity  as  a  result  of  the  auto- associative  feedback. 

We  also  mention  that  theorem  l.  page  28,  does  not  apply  to  the  auto-assoclator  since  the  •retrieval- 
address*  Is  not  Independent  of  the  datum  to  be  retrieved  since  the  Input  Is  generally  a  partial  rendition  of 
the  datum  to  be  retrieved.  The  theorem  could  be  modified  to  take  this  Into  account,  but  the  bound  on 
retrievable  Information  will  be  different.  The  auto-assoclator  has  the  advantage  that  the  Input  partially 
specifies  the  output,  so  the  auto-assoclator  needn't  'work  as  hard*  when  the  Input  specifies  a  substantial 
portion  or  the  output.  The  result  should  be  Improved  classification-performance  over  the  hetero-assoclator 
even  though  the  auto-assoclator  has  a  (perhaps  marginally)  smaller  matrix-entropy.  In  any  event,  the 
author  believes  that  the  methods  used  to  evaluate  classification  of  'single-feature*  vectors  might  aid 
quantification  of  the  performance  of  the  auto-assoclator. 

The  other  direction  of  research  to  be  mentioned  Is  the  storage  of  prototypes  whose  components  are 
zero-mean  gausslans.  This  Is  a  more  natural  mode  of  storage  for  the  outer-product  assoclator  since  the 
output  vector  produced  Is  best  characterized  as  the  proper  output-prototype  embedded  In  gausslan  noise. 
The  author  believes  that  the  analysis  would  begin  with  the  noisy-signal  analysis  of  Gallager  In  |12,  p.  32, 
Example  a  and  proceed  with  evaluation  of  the  matrix  throughput. 

Lastly,  we  mention  that  assoctators  built  rrom  other  storage  rules  such  as  error  correction  have  not 
been  treated  This  may  be  a  much  more  difficult  problem  since  evaluation  of  the  matrix-entropy  could 
problematic.  In  the  event  that  It  can  be  determined  or  approximated,  the  theory  presented  here  would 
then  be  applicable  for  performance  evaluation.  The  result  could  be  a  theory  relevant  to  multi-layer  error- 
correction  systems  such  as  the  Parker,  Rummelhart  •backpropagatlon*  networks. 
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0.3.  Epilog 

At  this  point.  I'd  (lice  to  let  my  editorial  hair  down  and  relate  a  couple  Interesting  observations. 
First,  notice  that  the  prototypes  were  treated  as  vectors  that  were  to  be  distinguished  as  exemplars  of 
distinct  categories,  .as  such,  a  premium  was  put  on  their  dissimilarity  so  that  the  system  could  tell  them 
apart.  Though  this  may  not  be  desirable  in  all  associator  tasks,  it  points  up  an  Issue  regarding  the 
•symbol*  view  of  intelligence.  If  we  Identify  the  stored  prototypes  as  ‘symbols*  one  could  view  symbols 
as  a  means  of  performing  large-scale  data-compresslon  on  the  environment.  This  not  only  enables  a 
system  to  vastly  simplify  its  representation  of  the  environment,  but  the  identification  of  such  symbols  In  a 
cognitive  system  could  subsequently  provide  a  parsimonious  theory  of  cognition  (Yes,  I  know,  ‘traditional 
Al*  already  knows  this].  Not  that  the  Identification  would  be  easy,  (If  symbols  can  be  said  to  exist  at  all. 
they  are  probably  too  “plastic*  and  malleable  to  be  static  entitles)  but  in  the  associator  at  least,  the 
symbols  are  the  prototype  pairs.  The  Input-prototype  reflects  the  system's  ‘Idea*  of  a  most  typical 
■object  type*  within  a  large  class  of  objects,  and  the  output- prototype  reflects  the  system's  representation 
of  the  object.  The  object  at  this  level.  Is  known  only  as  It  belongs  to  a  generic  class  of  objects.  All  other 
Information  Is  'discarded*  as  Irrelevant.  The  analysis  done  here  showed  data-compresslon  as  a 
consequence  of  the  presence  of  symbol/ prototypes.  However,  the  relation  should  go  the  other  way  as  well, 
as  evidenced  by  studies  of  'compressed*,  'hidden-unit*  representations  generated  within  backoropagatlon 
networks.  The  symbol  Is  doubtfully  an  explicit  feature  of  the  brain,  but  is  probably  an  emergent  property 
of  data-compresslon. 

While  I'm  making  conjectures  about  how  the  brain  works,  I  might  as  well  take  a  stab  at  the  amount 
of  Information  it  can  store.  The  figures  obtained  here  are  doubtfully  accurate  for  biological  brains  but 
serves  as  a  prediction  made  by  the  following  simplistic  assumptions 

1.  The  whole  brain  participates  In  storing  roughly  N  Items  where  N  Is  the  number  of  connections 

In  the  brain. 

2.  The  connection  strengths  are  normally  distributed  with  variance  roughly  N. 

3.  The  effect  of  all  connections  on  a  neuron  Is  the  linear  sum  of  the  individual  effects. 

How  embarassing!  Anyway,  assuming  10.000  connections  per  neuron  and  1010  to  1011  neurons  per  brain, 
we  get  10H  to  ioi5  for  the  number  of  connections.  The  information  storable  is  then  roughly  ,Mog0  S  or 
vSxio15  to  Sxio19  bits,  or  roughly  a  billion  megabytes. 

The  only  thing  that  win  rescue  this  estimation  Is  Us  crudeness.  The  noteworthy  thing  though  Is  that 
the  theory  Joes  make  a  prediction.  It  would  be  interesting  If  in  the  future,  a  better  understanding  of 
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cognition  brain-dynamics  would  render  better  assumptions  than  the  ones  given  here  and  if  so.  how  these 
assumptions  affect  the  estimate  In  relation  to  the  one  I’ve  Just  made.  I  leave  It  to  the  reader  to  estimate 
the  maximum  number  of  stimuli  the  brain  can  possibly  classify.  IT  you  come  up  with  a  number  (boy. 
would  It  be  big!)  let  me  Know  what  It  is  over  dinner  and  tell  me  what  your  assumptions  were.  Just  don  t 
publish  It  as  a  research  finding  (did  you  Know  that  we  only  use  10-percent  of  our  brains?  .  .  .).  Well.  I  ve 
put  in  my  ten-percent,  thanK-you! 
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Appendix  A 

Entropy  of  a  Binomial  Random  Variable 

In  this  section,  we  show  that  the  entropy  of  a  binomial  r.v.  Is  approximated  by  the  entropy  of  a 
corresponding  normal  r.v.  In  this  development,  a  binomial  r.v.  Sn  Is  a  sum  or  n  l.l.d.  Bernoulli  trials, 
where  a  Bernoulli  trial  Is  an  r.v.  with  outcome  0  or  l  .  We  will  only  consider  binomial  sums  or 
balanced  Bernoulll-trlals.  that  is.  Bernoulli  trials  whose  two  outcomes  are  equlprobable.  Such  a  binomial 
r.v.  has  variance  n/4  .  and  as  we  will  snow,  has  entropy  that  approaches  that  of  a  normal  r.v.  of  the  same 
variance.  The  entropy  of  a  normal  r.v.  with  variance  n/4  Is  'Ologj  (iren/2) .  Therefore  the  following 
theorem  will  be  proven  In  this  appendix: 

Theorem  1>  Let  5  be  the  binomial  r.v.  associated  with  the  sum  of  n  l.l.d.  balanced 
n 

Bernoulll-trlals.  Then 

11m  (  H{Sn)  -  (l/2)log  (xen/2))  =  0  M-l) 

n  -•  oo 

The  rate  of  convergence  Is  not  treated,  but  numerical  tests  have  Indicated  It  to  be  fairly  rapid.  It  would 
be  of  Interest  to  study  not  only  the  rate  of  convergence,  but  whether  or  not  the  convergence  Is  monotone 
In  n  .  That  Is.  one  would  expect  that 

|//(5fi+1)-(l/2)|og2(Te(n  +  l)/2)|  <  |  JflSJ  -  (l/2)log,  (iren/2)  |  (-4.2) 

for  all  n  =  l.  2 . 

The  rate  and  manner  of  convergence  are  not  explicitly  dealt  with  though  they  possibly  could  be  Inferred 
from  the  proof  that  follows. 

A  few  lemmas  are  needed  to  obtain  the  result.  Each  lemma  specifies  that  some  sequence  or  class  or 
sequences  exists  that  ensure  that  a  specific  Inequality  be  true.  Constraints  on  the  sequences  sufficient  for 
the  Inequality  to  hold  are  specified  by  each  lemma.  After  the  proof  of  the  lemmas,  the  proof  of  the  main 
theorem  begins  by  showing  that  a  sequence  exists  that  obeys  the  constraints  of  all  the  lemmas 
simultaneously  All  the  respective  Inequalities  will  then  hold  and  they  can  be  linked  together  with  the 
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•triangle-inequality*  to  give  the  result  of  the  theorem.  Arguments  used  In  the  various  proofs  were 
motivated  from  developments  In  Feller  [111  and  Rudln  [3dj. 

The  proofs  to  follow  generally  require  that,  given  an  arbitrarily  small  real  number  <  >  0  .  some 
positive  quantity  that  Is  a  function  of  the  positive  Integer  n  will  be  smaller  than  e  for  all  sufficiently 
large  values  of  n.  No  generality  Is  lost  by  assuming  that  e  Is  less  than  1.  77ns  assumption  will  be  used 
throughout  (except  where  otherwise  stated).  Further,  to  simply  tbe  arguments  and  notation,  we  consider 
only  even  values  of  n  .  The  arguments  for  odd  n  would  be  the  same  but  n/2  would  have  to  be 
replaced  by  (n  -  l)/2  .  Finally,  the  result  of  each  lemma  will  hold  when  e  Is  replaced  by  i/i  since  t  Is 
an  arbitrary  positive  constant.  This  will  be  Instrumental  In  the  proof  of  the  main  result. 

Notatlonally,  d>g(x)  Is  the  normal  probability-density  function,  ■  exp(—x2/2o2)  for  a 

normal  r.v.  „Y  with  a  mean  equal  to  zero  and  variance  a2  where  a  >  0.  We  will  be  concerned  with 

ft 

«r=v^  / 2  and  will  use  this  value  for  a  throughout.  The  standard  normal  density  function 

WE-  exp(-z2/2)  will  be  denoted  <(>(x). 

A.l.  Ignoring  Tails  of  the  Normal  Entropy  Integral 

The  entropy  of  the  a  normal  r.v.  with  variance  a2  Is  given  by  the  Integral 

~<t>a(x )log2  Og{x)dx  .  The  first  lemma  allows  approximation  of  the  normal  entropy  by  Ignoring  the 
•tails*  of  this  Integral.  We  show  that  for  cr  =  o(n)  s  Vn/ 2.  a  positive-integer  sequence  {r  }  ,  of  order 

O  (v  nlog2  (log2  n)  )  exists  that  grows  rapidly  enough  so  that  for  any  positive  t ,  the  Integra! 

-d>Jx) log.  <t>Jx)dx 

O  i  <7 

Is  within  t  of  the  true  entropy  for  all  sufficiently  large  n  .  From  this  It  follows  that  If  {a  }  Is  a 

sequence  whose  elements  exceed  those  of  {r^}  for  all  sufficiently  large  n  then  the  Integral 

V°gj  °adZ 

n 

will  be  within  <  of  the  true  entropy.  This  property  we  will  call  aeymptotle  convergence  In 
partl-ular.  if  {j^}  is  of  higher  order  than  {r^}  then  the  Just  mentioned  Integral  has  this  asymptotic 
property  Our  -onc'rn  Is  to  find  a  lower  estimate  of  the  order  of  {r  }  that  Is  sufficient  to  guarantee 
asymptotic  -0nv -rgenc»  The  following  lemma  and  Its  corallary  state  the  result. 
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Leant  2:  For  each  n  —  1.2 . let  Xn  be  a  normal  r.v.  with  variance  o'  where 

0=^2.  Given  t  >  0  there  exists  a  positive-integer  sequence  {r^}  of  order 
O  ( v/ nlog0  (log2  n)  )  with  the  following  properties: 

1.  Flrtt  property:  There  exists  a  positive  Integer  A^  such  that  If  n  >  A’  then 

|  WXJ  ~  J  *  -*„(*)'■*■ 82  <b<y(x)dx  )  <  £  (.4.3) 

fl 

2.  Second  property:  If  {an}  has  the  property  that  for  some  positive  Integer  A’2  , 
n  >  Ar„  Implies  a^  >  then  { an }  has  the  first  property. 

Proof:  For  any  n  the  entropy  of  X  Is  defined  by 

fl 

//(A'n)  =  f  -0a{x)\ot2  <t>a(x)dz 


-<t>g{x) log2  <t>g{x)dx 


1-4.4) 


Since  X_  Is  normal  with  variance  a2  the  entropy  H(X  )  Is  equal  to 

f»  fl 

l/2Iogj  2)T(72  <  oo  (12.  p.  32|.  Therefore  the  limit  above  Is  finite  and  by  definition  of 
■  Urn  *,  a  positive  Integer  r  exists  so  that  r  >  r  Implies  enuatlon  (A. 3)  with  r 

.  _  fl  fl  fl 

t  — *  OO 

replaced  by  r  .  We  now  show  that  for  fixed  t  >  0  ,  a  positive-integer  sequence  {r^}  can  be 
chosen  as  an  O  (vnlogj  (iog2  n)  )  function  of  n  so  that  property  1  holds. 

Note  that  0g{on)  =  1  i a  ■  <p( « )  .  Substituting  the  variable  u  =  x/o  Into  the  Integral  of 
(A. 3)  and  letting  6  =  r  a  .  one  obtains 

fl  fl 


/  "  -0  (x)!og„  0  (x)dx  =  £7  f  "  -0  (<7U)l0g„  <t>  (ou)dv 

J-r  9  *  9  y_4  9  •  9 

»  n 

f6n  d(u) 

—  <7  J - log.  (ip(u)/(7)du 

J—k  O  2 

fl 

rkg  f^n 

=  J  — <p(u)logj  o(u)du  +  log noj  n$ (u)du 


(.4.5) 


87 


We  denote  f*_b  —  olullog,  Q(u)du  by  7,(6)  and  denote  o(u)du  by  7,(6)  . 

If  6  Is  allowed  to  approach  Infinity,  then  7,(6)  converges  to  the  entropy  (1  Cllog,  (2*0 
of  a  standard  normal  r.v.  We  can  therefore  choose  a  constant  bQ  such  that  6  >  6Q  Implies 
that  /j ( 6)  is  within  <  2  of  Its  limit.  N’o  harm  Is  done  If  for  convenience  we  take  bQ  to  be 
larger  than  1. 

Since  the  lemma  is  concerned  with  the  dependence  of  6 on  n  as  n  gets  large,  no 
generality  is  lost  by  considering  only  n  >  132  and  e  <  l  4  .  For  such  n.  let11 

rn  =  f  2 );V2log,  (4/t)  (log2  (log2  (\/n/ 2)))1/2  +  6Qj  ] 

Since  n  >  132  and  t  <  1  the  quantities  under  radicals  are  non-negative.  Also  bQ  Is 

Independent  of  a  .  so  that  6^  =  0  (\/log2  (Iog2  n) )  .  The  lemma  will  follow  ir  we  can  show  for 
fixed  n  >  132  that  b  >  bn  Implies 

I  HiXJ  ~  <V6)  +  (|°g2V)-/8(6))  |  <  £  (.4.8) 

Denote  llm  7.(6)  by  /.( oo).  »  =  1, 2.  From  the  derivation  above  one  can  see  that 
6  -•  oo 

77(-Yn)  =  7,(oo)  +  log2  cr/2(oo)  so  that  (A.8)  Is  equivalent  to 

| /,(»)  + (log2<y)/2(oo)  -  (  7,(6)  +  (log2  <t)/2(6))  |  <  e  (.4.71 

If  we  show  that  the  conditions 

1.  |  7,(oc)  -  7,(6)  |  < 

2.  |  r2(x)  -  7,(6)  |  < 

hold  Tor  6  >  6^.  then  the  left-hand-side  of  (A. 7)  satisfies  the  following 
|  7,(301  *  log,  <7/,(30)  -  (7,(6)  +  log,  <77,(6))  | 


The  restriction,  n  >  132  is  used  to  diminish  the  chain  of  inequalities  on  the  next  page  concerning  the  parameter  i> 
It  also  allows  use  of  a  sequence  {r^}  whose  terms  are  as  small  as  possible,  though  this  isn  t  necessary  to  obtain  a  suitable 
sequence 
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/,0c)  -  /,(6)  -  log2  <fl/,(oc)  -  I2(b))  | 


<  I  /, (oc )  -  /j(6)(  -  |log2  <j\  I  /j(co)  -  /;(6)  | 


< 


t 

O 


lOg^  (7 


*1 


( 


(.4  8i 


and  the  conclusion  will  follow.  Since  6  >  b..  condition  1  Is  satisfied  by  definition  of  b..  We 

f»  —  0  u 

therefore  need  only  consider  condition  2. 


To  show  condition  2  Is  satisfied,  we  observe  that  If  <Hz)  Is  the  standard  normal 
distribution  function  then  we  have  jll.  vol.  1.  p.  178| 


1  -  <P(z)  <  - exp(  — z2/2)  all  z  >  0 

\Z2ff  z 


(.4  0) 


Also  for  b  6  R.  <P{-b)  =  1  —  #(6)  so  that 


/2(6)  = 


*(b)  -  *(— 6)  =  2 <P{b)  -  1 


(A.10) 


and 


/2(»)  = 


Mm 

4  —  oo 


<t>(u)du 


(-4.11) 


This  gives 


|/2(oo)  - /2(6)|  =  |l-(2*(6)-l)|  =  2|1-*(6)|  =  2(1  -*(&))  (.4,12) 

We  make  the  observation  that  the  equation  r  +  y  <  z  ■  y  Is  satisfied  for  all  z  >  i  If 

y  >  4/3.  Identifying  z  with  log2(4/e)  and  y  with  tog2  (Iog2  a) ,  we  see  that  under  the 

assumptions  for  n  and  «  that  have  been  made  on  the  previous  page,  we  have  z  >  4  and 

y>  4/3.  For  6  >  b  .  we  have  the  following  chain  of  Inequalities: 

~ ”  fl 

b  >  b 


>  v/21og2(4  <)v/log2  (log,  <7)  +  b0 

>  'Z 2l0gj  (4  <)  -  2iogj  (logj  <7) 
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=  v'siogjIH  Ologjff) 


Therefore  —b.  <  -log,  ((4,  Oiog,  <r)  so  that  exp(-fc2  2)  < 
choice  of  6Q  >  1)  and  we  have  by  (A. 9) 


Now  b  >  6  >  1  (by 

—  n  J 


2(i  _$(&))  <  — ^ —  exp(— *2/2)  <  2  exp(—b*/2)  < 

bsfln 


4logj  a 


21ogn  cr 


Using  (A. 12)  this  gives  condition  2. 

To  finish  the  proof,  we  note  that  {r^}  as  defined  is  O  (\/nlog2  (log2  n)  )  .  We  have  finished 
showing  that  the  first  property  of  { r n)  holds  for  =  132. 

If  {s  }  Is  a  sequence  and  N2  a  positive  Integer  such  that  n  >  Implies  3„  >  rn  . 
then  set  N  —  max  [  Ny  N^).  Since  -^(xjlogj  0ff(x)  >0  Tor  all  x  .  It  follows  that  for 
n  >  N 


H( 


f°° 

XJ  =  /  -^(x)log2  <t>g(z)dz 

J  —  oo 

>  J  n  -0<y(r)log2  <t>g(z)dz 

9n 

>  j'-Kmo^z)dx 


so  that 


m 


XJ  -  J  n  -09(x)log2  <t>g{z)dx  I 


<  |  H(Xn)  -  - 


^(x)log2  4>g(z)dz 


From  this  we  see  that  {s  }  has  property  l  mentioned  In  the  statement  of  the  lemma. 
'  n* 


'"W(  get  away  wuh  freely  intermixing  base-2  and  natural-base  logarithms  due  to  the  use  of  the  inequality  That  is.  for 
z  <  I  ,  we  hive  that  y  <  log,,  z  implies  exp(y)  <  exp((lof2  f)ln  *)  «  *  *2  <  i  .  Io  this  ease,  y  —  — h  /.  and 
z  ■■  </(4log;  a)  . 


00 


The  following  result  Is  immediate 

Corailary:  If  {j^}  Is  a  sequence  of  order  larger  than  O  (\/nlog0  (log,  n)  )  .  then  there  Is  a 
positive  Integer  .V  such  that  n  >  iV  Implies1* 


H(.\'n)  -  I  n  -0ff(z)log2  <t>g{x)dx  |  < 


A. 2.  Discretization  of  the  Normal  Entropy  Integral 

The  statement  and  proof  of  the  next  lemma  use  notation  borrowed  from  Rudin  In  [39.  Ch.  o|  In  his 

development  of  the  Relmann-Stelltjes  Integral.  The  arguments  he  gives  In  theorem  0.8  [39.  p.  125|  for  the 

Integrablllty  of  a  continuous  function  on  a  closed  Interval  Is  extended  to  our  situation.  We  desire  to 

approximate  an  Integral  with  a  Relmann-sum.  however  the  limits  of  Integration  are  not  fixed  and  the 

Integrand  varies  with  the  number  of  points  on  which  we  sum.  Our  notation,  which  Is  only  slightly 

different  from  Rudln's.  Is  as  follows.  If  b  >  0  then  a  partition  P  or  the  closed  Interval  [-6.  6|  Is  a 

finite  set  of  points  such  that  -6  =  x_f  <  x_r+1  <  ...  <  xf  =  b.  If  f{x)  Is  a  continuous 

function  defined  over  | — b.  6],  Its  maximum  and  minimum  are  attained  over  any  closed  Interval  In  the 

domain  of  /  so  we  put  M..  =  max  f(z),  m..  =  min  /(x).  i  =  -r,-r+ 1 . r-l.  The 

[«.«'+ 1 1  '  [*•',*«'+ 1 1 

quantities  Ub(P.f).  and  L^P.fl  will  denote  the  sums 

e-1  r-l 

L\{P'f]  =  Y  iV//.(r,+i  -  z<>  and  LtRn  =  Y  -  *,■) 

If  \ntUb(P.f)  and  sup  LAP,  f)  are  finite  and  have  the  same  value,  their  common  value  Is  called  the 
P  P 

Relmann-Stelltjea  Integral  of  /  over  {—6.  6|  denoted  by  fi_bf{x)dx.  From  the  definition  or  the 
Integral  Just  given,  it  Is  apparent  that  for  any  fixed  P 

LbiP.D  <  f  f(x)dx  <  Ub(P,f) 

J  —6 

Also  the  same  bounds  apply  to  the  sum  /(*■)(*•+,  -  x.)  since  m}.  <  /(x.)  <  A /  for 

i  =  -r.-r-i . r-l. 


By  "larger  than  O  l/In))’  where  f[n)  >  0  ,  we  mein  t  sequence  {#nJ  such  tbit  for  iny  constant  C  >  0  there  is 

in  .V  so  that  n  >  ,\  .mplies  •  >  C  f[n). 

r» 
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Before  proceeding  to  the  lemma  we  state  the  following  propositions. 

Propoaitlon  3:  For  any  a  >  0  the  functions  <p  and  fix)  =  -d>(x)log,  o(rl  have 
bounded  first-derivatives  over  the  domain  R. 

One  can  show  that  both  |/’(x)|  and  |tf>'(x)|  are  continuous  over  R  and  approach  tero  u 
These  together  Imply  boundedness  over  R  .  The  second  proposition  Is 

Propoaitlon  4:  Let  j  be  j  function  differentiable  over  a  connected  domain  D  C  R 
and  let  B  be  a  positive  constant  so  that  the  derivative  g1  satisfies  | g'[x)\  <  B  over  D  . 
Then  g  Is  uniformly  continuous  on  D  with  \g(x)  —  y(y)|  <  B  \x  -  yj  for  ail  x.  y  €  D  . 

Proof:  Because  g  Is  differentiable.  It  It  continuous  and  so  Integrate  over  finite 
Intervals.  We  have  the  following  Inequalities 

I g(T)  -  y(y)|  =  |  f  giu)du  \  <  f  |  g(u)  \ du  <  B  \x  -  y| 

Jy  Jy 

yielding  the  desired  result. 


We  now  state  and  prove 

Lemma  5:  Let  a  =  \/n/2  and  let  {r  }  be  a  sequence  or  positive  Integers  such  that 

'  n1 

6(n)  =  r  /<t  Is  o(\/n/log.  n) .  Given  €  >  0,  there  exists  a  positive  Integer  ,V  such  that 

fl  i 

n  >  N  Implies 


|  f  "  ~^(x)log2  4>a(x)dx 


n 


t 

n 

£  -^(Ologj  <t>a(i)  |  < 

n 


(-4.13) 


Proof:  We  continue  to  use  /(x)  5  -<#(x)log2  <t>(x).  As  shown  In  the  previous  lemma, 
the  Integral  In  equation  (A. 13)  Is  the  sum  of  /j(&(n))  and  log0  a -I^{b(n))  where  the  functions  /j 
and  /,  were  defined  on  page  80.  In  a  similar  fashion,  one  has 


t 

fl 

1 1  -c^Ologj  <?g(i) 

*  ffl 


1  ^  . 

-  TT  /(•/»  + 

a , 


(.4.14) 


Let  5  (n)  and  Sn(n)  denote  the  first  and  second  sums  on  the  right  hrM  side  respectively  The 
lemma  win  follow  if  we  can  find  an  .V  so  that  n  >  .V  Implies 


02 


l/i(6'--5l(n)l  <  • 


(.A. 15) 


log2  <r|/2( 6)  - -S,(n  i|  <  - 


(.-4.10) 


To  obtain,  this  we  will  require  that  N  be  large  enough  so  that 


1  { 
-f(r.'a)  <  - 

a  n  4 


U.17) 


1  e 

-(log,  <r)0(r  Jo)  <  - 
<j  2  n  4 


(•4.18) 


for  all  n  >  N.  From  proposition  3,  we  have  the  numbers  B{  = 
B.  =  max  |ip'(x)|  .  Let  N  .  N  be  Integers  such  that 

•  n  *  • 


max|/(x)|  and 

R 


(10  B^iV,))2 


(16  0,6(,V2))2 

2.  ^  >  - 1- - (log  v/Nj/2)2 


and  so  that  all  n  >  /V.  satisfies  each  of  these  when  substituted  for  /V. ,  i  =  1,  2  .  We  also 
require  that  Is  large  enough  that  n  >  A/j  Implies  relation  (A.17)  and  N2  Is  large 
enough  that  n  >  /V„  Implies  relation  (A. 18).  Such  numbers  N{,  N2  exist  since  (6(n))2  and 
(&(n)log2  (v/n/2))2  are  o(n)  and  the  left-hand-sides  of  (A.17),  (A. 18)  are  o(l). 

Fix  n  >  max  {  ;V2  }  and  for  notatlonal  convenience  let  r  —  and  b  =  6(n)  . 

Let  P  =  be  the  partition  of  I  —  6.  6)  with  x.  =  i,'a,  *  =  — r,  — (r  —  1) . r 

(remember  r  =  bo  by  definition  of  b  ).  Notice  —  x.  =  l/<r  =  2/Vn.  To  show  (A. 15), 
we  use  the  fact  that  n  >  .Vj  .  Now  .Vf^.  -  =  /(r)  —  f(y)  for  some  r,  y  €  [x^x^;  and 

we  have  |x  —  j/|  <  2  \fn.  From  this  one  obtains  Mj.  —  mj.  <  flj-2/v/n  by  proposition  4. 

Since  n  >  .V  n  satisfies  Item  1  above  so  that  i/vG  <  — —  and  we  can  write 

1  h 


St. 


o 


t 

86 


03 


From  this  It  follows  that 


vb(  P.D  -  Lb(P,D 


-  E 


(Mf.  -  m/.)(xi+l  -  r.) 


r —  I 


Also 


1  1  f  e- 1 

;5i(n)  =  - *,•>  +  r/(r/cr) 


=  Q.(n)  +  -/(r/<r) 

1  <r 

where  Q^n)  is  the  sum  X^!_f /(*,)(*i+l  -  *,)  •  Note  that  Qt(n)  Is  bounded  above  and 
below  by  lrb(P.f)  and  Lb(P.f)  respectively  (by  definition  of  these  two  latter  quantities).  By 
definition  of  the  Integral,  I^b)  Is  bounded  above  and  below  by  these  same  quantities.  It  follows 
that  |/j(6)  -  Qj(n)|  <  «/4  .  From  this  and  relation  (A.17),  we  have  that  (l/<7)Sj(n)  is 
within  e/2  of  f^b)  so  that  (A.15)  holds. 


The  arguement  that  equation  (A. 18)  holds  Is  similar.  In  this  case,  recall  that  n  >  N2  so  that 
Item  2  holds.  Using  the  notation  for  the  function  0  analogous  to  that  we  used  for  / ,  we  have 


2 


M  .  —  m  .  < 

4*  — 


\/n  86log2  (\/n/2) 


and 


f“l 

Vb(P.<t>)-Lb(P.t>)  =  2 

e  e 

<  - — 26  =  - 

86log2  (>/n,  2)  4logj  ( v/n  2) 

Finally,  lei  <3„(ni  =  E_,'  ~  and  notice  that  Q„(n)  is  bounded  above  and 

below  by  L‘b,Pl>)  and  Lb(P.i>)  respectively  as  Is  /*,{b).  Therefore  we  have  that 
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|/„(6)  -  <?„(n)|  <  i  4logj  <j  .  The  Identity  (l/cr)S2(n)  =  Q2(n)  (1  a)  <p(r  a)  and  relation 

(A.  18)  then  Imply  the  Inequality  (A. 18).  The  lemma  ,'ollowj  with  .V  =  max  {.V  ..V,}. 

A. 3.  Approximation  of  Binomial  Entropy 


A.3.1.  Error  Bounds  for  Logarithm  Terms 

Feller's  development  Ill.  vol.  1.  p.  170- 1821  is  expanded  here  for  the  sake  of  providing 
approximations  to  terms  of  the  binomial  probability  function  and  bounds  on  the  error  of  approximation. 
First  a  few  observations  with  respect  to  logarithm  approximation.  We  start  with  the  Taylor  series  for 
in  (1  +  t)  which  Is  known  to  be 


in  (1  +  t) 


i—O 


0  <  |  t  |  <  1 


and  for  In  (1  —  I)  It  Is 


-In  (l  —  f) 


'  E 


(• 

I  -I-  1 


0  <  \t\  <  1 


iA.-.t) 


(A.  20) 


so  that 


1  +•  t 


>2i 


In -  =  In  (1  +  f)  -  in  (1  -  t)  =  21^' -  0  <  If  I  <  1 

1  -  t  *■*  2»  +  1 


(A.  21) 


is  obtained  by  adding  the  two  series  In  (A. 18)  and  (A. 20).  See  [11,  vol.  1.  p.  61|  for  details  of  the 
derivation.  Subtracting  2 1  from  both  sides  of  (A. 21)  gives 


1  ■+■  t 

In -  -  2f 

1  -  f 


«s  £ 


,2i 


2i 


0  <  |  t  |  <  1 


(A.  22) 


We  are  Interested  only  In  values  of  c  between  0  and  1/3  so  that  the  series  In  (A. 22)  Is  positive.  In  other 
l  +  t 

words  In  j-— *  -  2f  Is  positive.  Comparing  this  with  a  geometric  series  with  t  =  1/3,  we  have  the  chain 
of  Inequalities 


Of3 


~  r*  **  £  2,  2<S  ~  2fS  1  3fS 

“  2l  -3  3  *  3  4-  3  1-1/9  4 

•  —  0  1*0  |«0 


Since  the  series  In  (A. 22)  contains  only  positive  terms  and  the  first  term  Is  1.  we  also  have 


05 


In 


1  +  t 
i  -  t 


Putting  these  inequalities  together  we  have 


2 13  1  ~  t  3 ts 

—  <  In -  -  2t  <  — 

3  l  —  t  4 


when  0  <  t  <  1/3 


(.4.23) 


Similarly  we  can  evaluate  In  (l  +  t)  -  t  for  f  In  the  stated  range.  Subtraction  of  t  from  the  series 
(A. 10)  yields 


In  (1  -*•  f)  —  ( 


-,!E 


<-<>• 
»  +  2 


The  series  Is  absolutely  convergent  over  the  range  of  t  considered.14  One  can  therefore  consider  the 
terms  of  the  series  In  any  order  without  altering  the  sum  (30.  p.  78|.  We  group  the  terms  of  the 
summation  In  pairs  to  get 


In 


OO 

(i  + 1)  - 1  =  2  V' t*  — - — 

'  «  +  2  t  + 


»— 0 


Since  the  terms  of  the  sum  are  positive,  In  ( 1  f )  ■ —  t  Is  negative.  To  assess  Its  magnitude  calculate 


r*-  i  +  2  « 


«— o 

OO 


i—O 


+  2| 


<  r 


Tt'  =  — 
ri  i  - 1 


»— o 


Since  0  <  f  <  1,3.  we  have  - — -  <  3/2  and  so 


3r 

In  |1  -  t)  -  t  |  <  — 


and  therefore 


n 

-3  r 

— <  In  ( 1  —  f )  —  t  <  0  0  <  t  <  1,  3  (.4.21) 


14 


K  series  is  stid  to  be  ibsolute 


ly  convergent  if  tod  only  if  it  converges  when  etcb  of  its  terms  is  repltced  by  its  tbsolute 


vtlue 
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A.3.2.  Expansion  of  Binomial  Coefficients 

These  observations  made,  one  can  now  follow  the  development  of  11.  vol.  i.  Ch.  VII  2  .  who  derives 
an  approximation  to  the  'central*  binomial  coefficients.  We  will  take  n  to  be  even  throughout  and  set 
v  to  be  n  2  to  simplify  notation.  The  case  for  n  odd  would  be  treated  similarly  with  v  =  (n  —  I).  2 

Let  ak  —  be  the  probability  that  the  binomial  sum  Sn  exceeds  the  mean,  «  2.  by  k. 

Since  a_k  equals  afc.  we  will  only  consider  non-negative  Integers  k  .  Our  goal  Is  the  analysis  of  the  error 
Incurred  when  ak  Is  approximated  by  the  normal  density  of  variance  n.  2  . 

It  Is  easy  enough  to  verify  that 

v{v  -  I)  .  .  .  (ir  -  (k  -  1)) 

ai  ~  aQ  (i/  +  i)(^~  2)  .  .  .  (v  t-  k)  (A.25) 

There  are  k  terms  In  the  numerator  and  In  the  denominator  so  we  may  divide  each  term  by  v  without 
changing  the  value  of  the  fraction 

*  -  t  k 

ak  =  v  n  1  -  -  /  n  1 + -  (.4.26) 

y-0  jmt  I 

For  k  <  v/ 3.  and  |y|  <  k  we  use  the  approximation  1  +  ji v  =a  exp(j/i/)  to  transform  the  product  In 
(A. 28)  Into 


*-i 


*-l  . 


7  ■  ; 


Using  the  fact  that  ^/«,1  J  =  k\k  -  l)/2  one  has 


ak  aQ  exp(-*2/p) 


(-4.27) 


Using  Stirling  s  formula  to  approximate  factorials,  the  term  aQ  =  2~n(^  is  approximately  s/FTn  and 
obtain  the  normal-density  approximation  to  the  binomial  coefficient  ak 

exp(  —  it2  v)  (.4.28) 


Notice  that  the  right-hand-side  of  this  equation  Is  the  normal  probability-density  function  of  an  r.v.  .V 
with  variance  cr  —  ne  evaluated  at  k  o  standard  deviations  from  the  mean.  Allowing  <(  and  «,  to 
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represent  the  errors  occurring  In  the  approximation  (A. 27)  and  In  that  of  a  respectively  put 


a*  =  ao  «XP(~*2  V)  exp(-<J) 


a0  =  V'2  (,rnl'  exp(<2 


so  that 


ik  =  n/2 7*n  exp(-fc2,i/)  exp(— («t  -  tj) 


{A.  31) 


This  defines  «  and  f2  and  the  relation 


exp(— &2/ 1>)  exp(£j)  =  U  (1  ~  Hv)  /  (t  +  (1 +.;/*') 


Is  obtained  from  equations  (A. 20)  and  (A. 29).  Taking  logarithms  of  both  sides 


in  ——  -  m  (l +  */i') 

/-i 


Using  the  fact  that  fc/f  = 


j/i/  +  #/j/  we  solve  for  £j 


El  *  j/'i'  2  j  k  k 

In - —  -  —  +  In  1  +  -  -  - 

1  —  ]/v  v  u  v 


(.4.33) 


A.3.3.  Upper  Bound  od  Binomial  Tall  Coefficients 


We  are  ready  to  state  and  prove 


Proposition  fl;  For  Integers  v  =  n/2  and  k  In  the  range  f\/tn]  <  k  <  n/8,  the 
relation  ak  <  a0exp(-fc2  i-)  holds. 


H»r»  Feller  omits  the  leediog  sign  10  the  error-exponent  by  setting  «■  v/s/m  exp(  —  **/<■')  expftj  -  e„) 
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Proof:  The  observations  made  In  the  previous  section  now  come  Into  play  By 
hypothesis,  we  have  that  k  <  n  8  so  k  v  <  1  3.  We  substitute  t  =  k  u  Into  equation 
(A. 221  and  see  that  the  terms  of  the  sum  In  equation  (A. 33)  are  positive  with  the  /*"  term  less 
chan  31  j  u)s  4  Since  J  j3  =  ( k{k  -  1))*  4.  this  sum  Is  less  than 


3 

4 


k-l 

V 


tj  i'T 


1  k'~k' 
18  „s 


4v 


We  can  get  a  lower  bound  on  the  term  to  the  right  of  the  sum  In  equation  (A. 33)  by  putting 
t  =  k  v  Into  equation  (A. 24).  The  sum  In  (A. 24)  Is  negative  and  larger  than  -3  2(Hc  t-r 
From  equation  (A. 33)  and  these  bounds,  we  get  an  upper  and  lower  bound  on  «r18 


k 4 


— ( k.L')m  <  t  <  — - 

2  1  4 1/3 


(.4.34) 


On  the  other  hand,  from  equation  (A. 23)  each  term  of  the  sum  In  equation  (A. 33)  Is  larger  than 
2 {j  t/)3  3  so  that  for  k  In  the  stated  range  the  sum  Itself  Is  larger  than 


*-i 


2  k*  -  k2 


t,  -  •  ^  •  4.  r>  —  n,  rv 

-  ;  — —  > 


4U 


Therefore  a  tighter  lower  bound  on  tt  Is 


k  3  2 

>  "1  -  2’ik/U) 

Su  * 

For  e  ,  Feller  ’ll.  vol.  1,  p.  182]  shows  that 


8i/“ 


(.4.35) 


1 


1 


- -  <  «  <  -  4-  - 

4n  20nS  *  380 ns 


(.4.38) 


so  that  0  <  <2  <  n/3  In  any  event.  Combining  this  with  the  lower  bound  for  we  get 


>  77 

8r 


3  k  1 

2  J1  ~  3n 


We  set 


£_  3  k'  1 

8^  "  2  7  ~  3"  >  ° 


Io  this  section,  only  the  lower  bound  will  be  useful. 


The  upper  bound  will  be  useful  in  t  liter  section. 


to  get  a  sufficient  condition  for  to  be  positive.  This  condition  Is  me.  for  all 

k  >  v/Tn.  Therefore  for  k  In  the  range  stated  la  the  hypothesis,  we  have  that  the  term 
expt  — ; « t  -  <„))  of  equation  (A. 31)  Is  less  than  1.  Equation  (A  31)  then  Implies  that 

a  <  a0expi-/fc;  v)  and  the  lemma  Is  proved. 

A. 4.  Ignoring  tails  of  the  Binomial  Entropy  Sum 

In  this  section,  we  state  and  prove  a  lemma  (called  In  this  section,  the  tans  lemma)  that  shows  one 
can  approximate  the  binomial-entropy  by  summing  relatively  few  terms  of  the  entropy-sum.  The 
approximation  approaches  the  entropy  of  Sn  as  the  total  number  n  of  terms  gets  lar,,- . 

A. 4.1.  Relation*  Used  In  the  Proof  of  the  Tall*  Lemma 

Before  proving  the  last  two  lemmas,  a  few  observations  necessary.  These  relate  to  the  error- 
magnitude  to  be  encountered  In  the  tails  lemma. 

Proposition  7:  For  t  In  the  range  —1/3  <  t  <  1/3  ,  the  relation 

|  1  -  exp (-1)  |  <  3/2  |  /  |  (A. 37) 

Proof:  This  Is  easily  seen  from  the  Inequalities  obtained  rrom  the  Taylor  series  for 
exp(-f) 


I  1  -  exp(  — £)  | 


(»'+  D! 


”  (-0* 

s  I'NE—  i  s 


i  <  E  i  <  f 


f 

1  -  t 


1 

<  - 

~  1  -  1/3 


t\ 


One  more  observation  must  be  made  before  pr  ceedlng  to  the  lemma.  Since  11m  xlog,  i  =  0  the 

*  —  o  * 

function  xlog,  x  Is  continuous  over  the  closed  Interval  |0,  l|  provided  we  define  Olo, ,  0  =  0  to  be 
consistent  with  the  mentioned  limit.  Taking  derivatives,  (log,  x  =  (In  *)l°g,  e)  one  can  verify  that  the 
function  — xlcg,  x  is  unlmodal  with  maximum  value  e~lIog,  e  achieved  at  x  =  e-1  The  function  is 


continuous  on  the  closed  Interval  O.T  and  so  Is  uniformly  continuous  In  this  range. 
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Given  <  >  0  .  we  seek  conditions  on  z  positive  such  that  |  zlog,  z  |  <  t. 

Proposition  8:  Let  e  >  0  be  given.  Then  If  z  €  ,0.  1]  and  a  Is  any  number  In  the 
range  0  <  a  <  l  the  Inequality 

x  <  (aee  log^  e)l/(1  “ a)  (.4.38) 

Implies  that 

|  zlog2  z  |  <  t  (.4.30) 

Proof:  Given  the  hypothesis,  (A. 38),  solve  for  £  to  get 

t  >  l/a  ■  z1  ~  *  ■  c~l  •  logj  e  (A. 40) 

Since  za  €  |0. 1 !.  It  follows  that  z®logj  z®  <  e-1  logj  e.  From  this  we  have 

|  zlogj  z  |  =  -zlog2z  =  -z1  ““z^log^  [(z®)1/®! 

- - z1  ~  ®(-z®log2  z“)  <  -  •  z1  ~  ®  ■  e~x  ■  log^  e 

The  last  expression  Is  less  than  e  by  relation  (A.40)  so  that  the  proof  Is  complete. 

For  our  purposes  a  =  1/2  can  be  chosen  to  give 

z  <  (e£  '21og2e)2  =»  |  zlogj  z  |  <  e  (-4.41) 

A.4.2.  Proof  of  the  Binomial  Tails  Lemma 

We  are  now  ready  to  state  and  prove  the  tails  lemma. 

Lemma  0:  Given  «  >  0  ,  there  Is  a  sequence  (r  }  of  order  O  (v^ niog,  n)  such  that 

r 

n 

lW(5„>  ~  L  -a4l0S2aJ 

f 


<  £ 


1.4.42) 
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Proofs  For  n  <  10  ,  we  can  take  =  «.  For  n  >  10  choose  =  [\/2nlog„  nj 

and  notice  r  -  l  >  V/7n  Since  r  Is  O  (> / n!og„  n).  we  can  choose  an  .V  large  enough 

n  fi  • 

that  the  following  conditions  hold  for  all  n  >  S  :17 

1.  r  <  ft/  6  —  1 

n 

2.  n  >  (21og2  «/( ee )) 

For  fixed  n  >  N.let  k  =  r  +•  i  and  write  the  following  Inequalities 

“**  ft 

k  >  V 2nlog2  n  =  V nlogj  n  +  nlog2  n  >  \/ nlogj  n  +  niog2  (21og2  «/(«)) 
so  that 

fc2  >  nlog2  (2niog2  e/(ee)j 

and 

-2/ fc2/n  <  21og2  (ee/(2nlog2  e)) 

This  Implies 

exp(-2fc2/n)  <  (et/(2nlog2  e))2 

Since  SZ  <  k<n/0  .  proposition  0  Implies  o4  <  a0  exp(— 2 k2/n).  Together  with  the 
fact  that  aQ  <  1  this  Implies  for  l  >  k  : 

<*(  <  a*  <  a0  exp(-2fc2/n)  <  exp(-2fc2/n)  <  (ee/(2rtlog2  e))2 
We  see  that  a;  satisfies  the  hypothesis  of  proposition  8  with  e  replaced  by  t/n  and  therefore 

i 

|  a(log2  a;|  <  -  rn  +  i  <  l  <  n 


Notice  that  the  second  condition  stipulates  that  tbe  left-hand-side  of  (A. 42)  will  be  le»  then  any  t  >  Ilog„  e/(en| 
Therefore.  dlog„f/(e«|  is  roughly  the  muimum  entropy  lost  when  5^  is  ‘approximated*  by  a  random  variable 
■S  “  min  {S  r  )  We  sty  ‘roughly*  because  we  h»ve  not  accounted  for  tbe  fact  that  S  '  will  equal  ±  r  with  a 

^  ft  ft  ft  ft 

slightly  higher  probability  than  the  probability  that  S  will  assume  these  two  values. 

ft 
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Is  the  desired  upper  bound  on  'tall*  terms  of  the  binomial-entropy  sum.  We  can  now  verify 
the  conclusion  (remember,  n  Is  even) 


I  WSJ-  E  -aal0gjaJ 

*— f 

n 

n/2 

=  1 2  Y1  -a*l0^3*i 

k— *■  -fl 

fl 

n/2 

<2^|  a*loa2  °k  I  <  n  ■  t/'n  =  e 
bmr  + 1 

ft 


n/2 


-  i  E 


-a.log.  a 

k  2  n 


-  E 


■-n/2 


U- r 


The  lemma  Is  proved.  We  also  have  the  following  corallary  for  sequences  or  higher  order  than 
the  sequence  {r  }  : 

Corallary:  For  «.  {r^}  as  In  the  lemma,  let  {«n}  be  a  positive  Integer  sequence  such 
that  n  >  s  >  r  for  all  n  ,  then 

$ 

n 

lW(5n>  *  E  -a*l0&2a*l  <  £ 

*  —  9 

n 

Proof:  The  terms  In  the  sum  above  are  all  positive.  Since  n  >  s  >  r  ,  we  have 

fl  fl 


H(Sn) 


E  -a*,082  a* 


9 

n 

>  Y1  -akl0g 2  °k 

* 

fl 


> 


r 

fl 


~a*l0*2 


kmi—r 


Because  the  leftmost  quantity  In  this  string  of  Inequalities  Is  within  t  of  the  rightmost 
qua.itlty,  the  result  of  the  corallary  follows. 
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A. 5.  Similarity  of  Binomial  and  Normal  Entropy  Approximations 

We  have  'chopped*  the  tails  of  the  normal  entropy  integral  and  then  discretized  It  to  obtain  a  sum 
as  a  close  approximation.  The  tails  of  the  binomial  entropy  sum  were  also  ‘chopped*  to  obtain  an 
approximation  that  Is  a  sum  of  far  fewer  terms.  We  now  need  to  show  that  the  resulting  approximations 
for  the  normal  entropy  and  for  the  binomial  entropy  are  good  approximations  of  each  other 

Lemma  10:  For  n  =1.  2.  .  .  .  let  a  =  '/n/  2  and  let  {r^}  be  a  positive-integer 

sequence  In  O  ( v  n log„  n  )  .  Given  £  >  0,  there  exists  a  positive  Integer  .V  such  that 
n  >  .V  Implies 


e  e 

n  n 

|  £  -a*!°e2a*  “  H  -^(*)i°g2  <t>g(k)  |  < 


U  43) 


Proof:  The  sequence  {rn}  Is  in  O  (\/ nlog2  n)  so  we  consider  the  case  that  >  \/3n 
for  all  sufficiently  large  n  ,18  Also  there  exists  a  C  >  0  such  so  that  r  <  C  \Ailog.  n 

fl  • 

for  all  n.  It  follows  that  a  positive  Integer  iV.  can  be  chosen  so  that  v/si n  <  r  <  n/8  Tor 
all  n  >  \0  Let  n  be  in  this  range  and  put  t  =  —  <  where  tj  ,  e?  are  defined  by 

equations  (A. 29)  and  (A. 30)  as  functions  of  the  positive  Integer  n  and  k  —  1,2 . n. 

From  these  two  equations  we  have  that  a.  =  ^  (k)  exp(-t)  and  Tor  *=1.2 . r  we 

MO  f| 

can  bound  the  terms  of  the  difference  (A. •43): 

I  -<Vog2  a*  “  (-lV*)lo*2  I 


=  |  -<t>g(k)  exp(-<)log2  (4>g(k)  exp(-f))  -  (-«>tf(*)log2  ^(*))  | 

=  |<pff(*)(l  -  exp(-f))log2  0g(k)  +  <fig(k)t •  exp(-f)  log2  e| 

<  \<t>g(k)\oi2  t>g(k)  |  |l  -  exp(-f)|  +  |^(*)|  |f|  |  exp(-f)|  |log2  e|  (A. 44) 

We  need  upper  bounds  on  the  terms  |  f  I  ,  and  |l  —  exp(  —  <)|  .  To  get  an  upper  bound  on  1 1  |  , 
consider  the  following. 

Since  r  >  s/zn.  we  have  r  * iii/3  >  3r  2/2t^.  For  any  k  <  r  we  get 

n  —  fl  '  —  n  '  —  n 


1 8  ■* 

The  case  that  t  <  v3n  results  in  a  smaller  number  of  terms  being  summed  in  relation  (A.<3).  The  upper  bounds 
fl 

for  the  error  derived  in  this  section  would  still  applicable  to  these  lermj.  By  summing  less  terms  the  tot»J  discrepancy 
between  the  two  sums  in  ( A . 4 3 )  will  be  less,  hence  the  case  that  r  <  \/Jr »  is  subsumed  by  the  case  that  r  >  i/ln 

fl  *1 
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1  -  4 

2  y"  2  if 


also  kir{ 4t/S)  <  r  (4t/1) 

rc 


fl 


By  equation  (A. 34)  then,  we  have  that 


n 


Since  |  e2  |  <  n  3  we  also  have  e2  <  rn*/{4i/3)  and  so 

It  l  =  |  e,  -  I  <  l«il  +  l«al  <  \4/(2S) 

In  turn,  r  */(2 v3)  is  less  than  4<74(log,  n)2/n  where  C  was  defined  at  the  beginning  of  the 

fl  • 

proof. 


To  get  a  bound  on  |  l  -  exp(  — f)|  we  take  a  positive  Integer  so  that  n  >  /V 
implies  that  4C"*dog«,  n)2/n  <  1/3.  Therefore  we  have  |  f  |  <  1/3  and  so 

|  l  -  exp( — f )  |  <  3 / 2 1  t  |  by  proposition  7. 

Finally,  for  |  t  \  <  1/3,  exp(-f)  Is  bounded.  Let  If  be  a  constant  so  that 

exp(-f)  <  K  for  |  f  |  <  1/3.  Continuing  the  chain  of  Inequalities  In  (A. 44).  noting  that 
|  t>a(k)  |  <  1.  we  have 

I  «>,(*) log,  <t>a(k)  ||l-  exp( — < )  |  +  |  |  ■  1 1 1|  exp(-f)  ||  log2  e  | 


<  c_,(log2  e)  (3/2)  |  1 1  +  /Cflogj  e)  |  t  | 

==  ((3/2)c— 1  +  fO(!og2  e)|  t  | 

<  ((3/2)e-1  +  fO(log2  ejC^log,  n)2/n 
=  A(log2n)2'n 


where  A  Is  the  positive  constant  ((3/2)  e-1  +  Kl(log,  e)C* .  To  finish  the  lemma  consider 
again  the  left-hand-side  of  (A. 43)  which  Is  seen  to  satisfy 
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-a*l0g2  a* 


f 

H 


< 


r 

n 


£  |-a4log2  «4  -  (-^(*)log2  <>g(k))  | 


r 


There  are  2rn  *  l  terms  In  this  sum,  each  positive  and  less  than  Aflogj  n)2/n.  Since 
r  <  C  ■  V nlog„  n  ,  the  sum  Is  less  than 

A  “  * 


lC\J nlog2  n  +  l|  A(Iogj  n)2/n 

which  Is  O  ((log0  n  Hr  sTnl  It  follows  that  there  Is  a  positive  Integer  N2  such  that  If 
n  >  iV,  then 


;2C\/nlog2  n  4  l!,4(log2  n)2/n  <  ( 


From  these  Inequalities,  the  lemma  follows  with  iV  =  max  {  Sy  N2  }. 


A.8.  Proof  of  the  Main  Theorem 


We  now  restate  and  then  prove  the  main  theorem. 

Theorem  11:  Let  S  te  the  binomial  r.v.  associated  with  the  sum  of  n  l.I.d.  balanced 
n 

bernoulll  trials.  Then 

llm  {H(S  )  —  (l/2)log_  (nen/2))  =  0  (A. 45) 

A  i 

n  oo 

Proof:  We  will  show  that  for  a  given  «  >  0.  there  exists  a  positive  Integer  N  such 
that  n  >  .V  Implies 


|  H(Sn)  -  (1  2)log2  (Jen.  2)  |  <  «  (-4.48) 

Lemmas  2.  5,  9.  and  10  can  each  be  restated  with  replaced  by  ■e/4*  In  their  respective 
relations:  (A. 3);  (A. 13):  (A. 42):  (A. 43).  These  lemmas  will  still  be  true  when  modined  In  this 
way.  Each  lemma  required  a  sequence  that  was  constrained  In  some  way  to  produce  that 
particular  lemma  s  result.  Our  plan  Is  to  exhibit  a  sequence  {tn}  that  simultaneously  satisfies 
the  constraints  of  all  four  lemmas.  The  Inequality  mentioned  In  the  conclusion  of  each  lemma 
will  then  te  true  The  triangle  Inequality  can  then  be  used  to  show  that  the  Inequality  (A. 48) 
holds. 
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18 

thtn 


Let  {en}  be  the  sequence 

/  n  n  <  10 

n  \  [v/ 2nlogj  n  J  otherwise 

This  Is  the  sequence  used  In  the  proof  of  lemma  0  to  render  the  Inequality  (A. 42)  (with  V 
replaced  by  *e,  4*  ).  In  particular,  for  some  Xf  >  0 


i 

n 

\wsn)  -  £  -a*lo82a*  i  <  </4 

k— I 

n 

for  all  n  >  X. 

Since  {f  }  is  O  (v/nlog,  n)  >  O  (\/nlog2  (log2  n))  the  corailary  to  lemma  2  Implies  that 
there  exists  a  positive  Integer  X^  such  that  for  n  >  N, 2  we  have 

|  H(XJ  -  J  * -0g(T)log2<j>o{z)dx\  <  e/4 

9  n 

Also  a  a  —  O  (\Zlog,,  n).  that  Is.  a  Jcr  —  o(\/n/log  n)  and  by  lemma  5  there  exists  a 
positive  Integer  jV  so  that  for  n  >  we  have 

|  f  n -<pg(z)loi  2<t>g(x)dx  -  I  <  £^4 

n  f« 

Finally,  from  lemma  10,  we  have  that  there  Is  a  positive  Integer  so  that  n  > 
Implies19 


|  ^  -a^log,  at  -  ^  -^(J(*)l°g2  «>„(*)  |  <  </* 

„  * - 'n 

Now  let  X  =  max  {  .V,,  .V„,  X..  XA  }  and  consider  any  n  with  n  >  X.  Since  the  entropy 
of  a  normal  r.v.  with  variance  n/4  Is  l/2logj  (iren/2) .  we  can  write 


fhe  requirement  thst  s^  >  v/7n  id  Itmmi  9  u  satisfied  for  n  >  12  .  one  of  Xy  X„  X yX ^  to  be  jreuer 

12  so  that  these  requirements  will  be  met  for  n  >  max  {  Xy  Xy  .Vj.  X 4  }  to  wb»t  follows. 
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1  nen 

-log,—  -  msn)\  =  \mxn)  -  H(sn)\ 


=  \  J_  -og{z)\°g2c>g(z)dx  -  I  "  -<t>g(z)log2  Oa(x)dz  +  I  "  -0O1  z )log2  Oa(x)dx 


-  £  +  Y  -tJLW  oIjV*) 


fc™— • 


-  £  — ai‘°®2  a4  +  £  -a*l0«2a*  -  WSJ  I 


it™—* 


/oo  /-* 

^-^(zjlog,  0(y(z)dx  -  ]_  -<t>o(z)\oi2<t>g(z)dz\ 


9 

f'n  n 

|  J_^  -<t>g(Z)\Q^^g(z)dz  -  Y.  -V*),0*jV*>l 


—  f 


£  -^(*)IOg2  0^*)  -  £  ~a*l082  a* 


fr™— » 


+  I  £  — a*'°®2  a4  -  ^„)l 


k^m—B 


Since  each  of  the  four  absolute-value  terms  Is  less  than  i/i  by  the  previous  lemmas,  their  sum 
Is  less  than  t  .  The  theorem  Is  proved. 
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Appendix  B 

Mutual  Information  and  Vector  Geometry 

In  this  appendix,  we  derive  a  relation  between  the  mutual  Information  shared  by  two  ±i-veetors. 
A  and  B  ,  and  their  Hammlng-dlstance.  The  vector  A  will  be  a  balanced-Bernoulll  vector  and  the 
vector  B  will  be  chosen  at  random  from  within  a  neighborhood  of  A  of  a  given  radius  p  .  Vector  B 
will  therefore  provide  Information  about  A  .  We  will  determine  the  relation  between  the  Information  B 
provides  and  the  neighborhood  radius. 

B.l.  Relation  of  Neighborhood- Size  to  Neighborhood-Radius 

Let  A  be  the  set  of  n-dlmenstonal  ±l-vectors,  and  for  the  moment,  let  A  and  B  be  chosen 
randomly  from  A  .  We  wish  to  know  the  fraction  of  A  lying  within  a  given  radius  p  of  A  .  Toward 
this  end.  consider  the  ball  B(p)  of  vectors  of  A  that  are  within  a  radius  p  of  A  .  Since  all  vectors  of 
A  are  equlprobable  outcomes  of  B  .  we  can  determine  the  fraction  or  vectors  lying  In  B(p )  by 
determining  the  probability  that  B  will  come  from  B{p) .  Because  these  vectors  are  chosen  at  random 
rrom  A  .  they  are  balanced-Bernoulll  vectors.  Let  X  be  the  number  of  components  of  B  that  disagree 
with  their  counterparts  In  A.  The  r.v.  X  Is  the  Hammlng-dlstance  HD(  A.  B)  between  A  and  B.  It 
Is  blnomlally  distributed  with  mean  n/2  and  variance  n/4  [28|.  By  the  central-limit  theorem,  we  can 
approximate  the  cumulative  binomial  probabilities  with  a  normal  distribution  having  the  same  mean  and 
variance  (see  Llndgren  [30,  p.  158|). 

From  this  we  see  that  the  probability  that  B  will  lie  In  B(p)  Is  P[X  <  p)  which  can  be 
determined  by  the  normal  distribution  with  mean  n/2  and  variance  n/4  .  Half  the  vectors  of  A  will  lie 
within  a  distance  of  n/2  of  A .  so  so  we  consider  the  case  that  p  <  n/2  so  that  B(p)  comprises  less 
than  1/2  of  A  ■  If  we  put  Z  =  (X  —  n/2)/(Vn/2) ,  then  Z  Is  a  standard  normal  r.v.  and  we  can  write 

PiX  <  p)  =  P[Z  <  (p  -  n/2)/(VZ/2))  =  <H- z)  (B.l) 

where  c  Is  the  positive  number  (n/2  -  p)/(\/n/2) .  It  Is  known  that  for  z  positive  (say  z  >  3  )  the 


approximation 


is  quite  accurate.  11.  v.  l,  p.  I75j 


Now  suppose  we  want  the  ball  B(p)  to  comprise  M  ^  of  A  ,  where  R  >  l  .  We  put 
P[X  <  p)  =  A f~R  in  (B.  1 )  and  use  the  approximation  (B.2)  to  get 


M 


-R 


exp(-^)/2 


v/2Jr. 


tz 


This  can  be  rearranged  to  get  the  In  the  exponent  In  terms  of  the  other  parameters 


(B-3) 


c  =  >/ 2/?l n  Af  -  In  (2*0  (Ba) 

which  Is  a  recursive  expression  In  z  .  As  A/  grows,  z  should  grow  slowly.  For  large  M  then,  the 
*  2/?ln  A /  term  under  the  radical  should  dominate  so  that  :«s  'J 2/?ln  A/.  We  put  this  value  In  Tor  the 
under  the  radical  In  (B.4)  to  get 

z  =»  V?R\ n  A/  —  In  (4tr/?ln  A f)  (B. 5) 

which  Is  a  good  approximation  to  z  when  M  Is  large  (this  can  be  verified  by  plugging  the  right-hand* 
side  of  (B.5)  In  Tor  z  In  equation  (B.3)).  The  value  of  p  Is  ascertained  from  the  definition  of  z  to  be 

n  >/n  n  y/n  ,  ■■  „ 

p  =  -  -  — z  =  -  -  —  v  2/?l n  \f  -  In  (4*/?ln  A/1  (B. 6) 

2  2  2  2 

So  a  ball  encompassing  roughly  A/  of  A  has  the  radius  given  above. 

B.2.  Relation  of  Mutual  Information  to  Neighborhood-Radius 

Now  suppose  B  Is  chosen  at  random  from  B{p)  rather  than  from  A  .  An  observer  of  B  can  Infer 
that  A  lies  In  a  radius  p  of  B  .  This  radius  Is  such  that  a  neighborhood  (or  ball)  about  B  comprises 

_  o  n 

A/  of  A  Knowledge  of  B  therefore  constitutes  an  A/  -fold  decrease  In  the  possible  values  of  A  . 
Therefore  the  Information  B  provides  about  A  Is  log^  A/ ^  =  /flog,  A f  bits. 


no 


With  regard  to  the  n^-dlmenslonal  Input-vectors,  of  an  assoclator.  the  vector  A  represents  an 
Input-prototype  and  B  represents  the  assodator-lnput  F k’  chosen  from  Bk(p)  (see  the  chapter  on 
classification,  page  55).  The  minimum  value  of  R  allowed  In  this  case  Is  nM,'(2nQ)  where  nQ  Is  the 
dimension  of  the  assoclator-output  and  M  Is  the  number  of  stored  associations.  Plugging  this  In  for  R 
In  (B.6)  gives  an  upper  bound  for  p 

nj  \fn[  - 

p  <  —  —  - vVvfln  M/n*.  -  In  (2x2Mn  A//n„)  (6.7) 

22  U  U  ' 


If  we  examine  the  n0*dlmenslonal  output-vectors  on  the  other  hand,  the  vector  A  represents  the 
output-prototype  and  B  Is  the  assoclator-output  Gk"  .  We  want  a  classifier  sampling  B  to  be  able 
to  categorize  It  with  A  on  the  basis  of  B 's  distance  from  A  (see  figure  5-3,  page  61).  It  Is  the  maximal 
distance  p  that  B  can  be  from  A  that  must  be  determined.  To  find  this  maximal  distance,  recall  that 
the  minimal  information  that  B  must  provide  about  A  In  this  case  Is  log2  M  bits.  We  can  substitute 
the  value  l  for  R  In  equation  (B.6)  to  get  an  upper  bound  Tor  the  distance  that  B  can  be  from  A  .  The 
bound  Is 


P 


\/21n  M-  In  (4*ln  M) 


(6.8) 


There  is  a  problem  however.  In  this  case,  each  ball  about  an  output-prototype,  of  the  radius  on  the 
rlght-hand-slde  of  (B.8),  encompasses  l/M  of  the  total  number  of  possible  n^-dlmenslonal  output- 
vectors.  This  means  that  each  prototype  has  a  l/M  chance  or  lying  In  the  ball  about  A  .  Since  there 
are  M  —  l  output-prototypes  aside  from  A  Itself,  we  would  expect  one  of  them  (on  average)  to  lie  In  the 
ball  about  A  .  We  call  this  a  collision.  In  the  case  of  a  collision  of  two  output-prototypes,  the  ball 
about  one  prototype  would  largely  overlap  with  the  ball  about  the  other.  Many  of  the  vectors  within  p 
of  one  of  the  prototypes  would  not  get  classified  with  that  prototype.  This  problem  exists  for  all  the 
output-prototypes.  That  Is.  each  prototype  will  have  a  collision  with  an  average  of  one  other  when  p  Is 
given  by  the  rlght-hand-slde  of  (B.8) 

To  remedy  the  problem,  we  make  the  radius,  p  ,  small  enough  so  that  each  ball  contains  only 
V*  of  the  output-space.  Now  any  two  output-prototypes  have  a  l/M2  chance  of  collision  with  each 
other.  Since  there  are  roughly  M*  /2  possible  pairs  of  output-prototypes,  less  than  one  such  pair  on 
average  will  suffer  from  collision.  If  the  assoclator  produces  B  to  lie  within  this  smaller  neighborhood  or 
A  .  then  A  will  be  reliably  classifiable.  Since  the  ball  constitutes  M~ 2  of  the  output  space,  we  put 


Ill 


R  =  2  In  (B  O)  to  get 


P  > 


2 


>/-41n  M  -  In  (8?rtn  A f) 


(0.9) 


This  Is  shown  as  a  lower  bound  on  p  since  It  Is  sufficient  but  not  necessary  for  proper  performance.  In 
other  words,  some  values  of  p  Intermediate  between  that  of  relation  (B.9)  and  relation  (B.8)  should  be 
workable.  In  fact,  using 


P  = 


(0.10) 


would  result  In  O  (  vTn  collisions  among  the  M  output-prototypes  so  that  a  vanishingly  small  fraction 
of  the  prototypes  represent  'degenerate1  categories.  We  conclude  then,  that  large  systems  having  stored 
a  correspondingly  large  number  of  prototypes  should  be  able  to  operate  nearly  optimally.  That  Is.  an 
output-vector.  B  .  will  be  constrained  to  lie  within  p„  of  Its  output- prototype  A  ,  where  p ..  nears  the 
upper-bound  In  (B.8)  as  M  gets  large.  On  the  other  hand,  for  smaller  A/  we  may  need  a  redundancy  at 
the  Input  that  Is  1-1/2  to  2  times  the  minimal  *M/(2nQ) .  This  assures  the  output  Information  Is 
(3/2)log2  A f  to  2log2  M  respectively  as  required  by  (the  respective)  relations  (B .10)  or  (B.8). 


112 


References 


1.  Abu-Mostafa,  Yaser  S.  Connectivity  Venus  Entropy.  IEEE  Conference  on  NeurnJ  information 
Processing  Systems  •  Nnturnl  and  Synthetic,  IEEE.  November.  1087. 

2.  Abu-Mostafa.  Yiser  S  and  St.  Jaques.  Jeannlne-Marle.  •Informitlon  Cnpnclty  of  the  Hopfleld 
Model1.  IEEE  Transaction s  on  Information  Theory  IT-31,  No.  4  (Juijr  1086).  481-484. 

2.  Amlt,  Dtnlel  J  .  Gutfreund.  Hincoch.  Sompllnsky,  H.  •Spln-Glnss  Models  of  Neural  Networks*. 
Physical  Review  A  3£.  2  (August  1086).  1007-1018. 

4.  Amlt.  Daniel  J..  Gutfreund.  Hincoch,  Sompllnsky,  H.  ‘Storing  Infinite  Numbers  of  Patterns  in  s  Spln- 
Gliss  Model  of  Neural  Networks*.  Physical  Review  Letters  55,  14  (September  1086),  1630-1633. 

8.  Andenon.  James  A..  Sllversteln.  Jack  W„  Ritr.  Stephen  A.,  and  Jones,  Randall  S.  ‘Distinctive 
Features.  Categorical  Perception,  and  Probability  Learning:  Some  Applications  of  a  Neural  Model*. 
Psycologxcal  Review  84.  6  (1077).  413-461. 

2.  Andenon.  James  A.  ‘Cognitive  and  Psychological  Computation  with  Neural  Models*.  IEEE 
Transactions  on  Systems,  Man,  and  Cybernetics  SMC-13,  6  (September/October  1083). 

7.  Andenon.  James  A..  Golden.  Richard  M..  Murphy.  Gregory  L..  SJ*.I£.  Institute  on  Hydrid  and 
Optical  Computing,  Ed.  1.1  S.Z.U.  Volume  -.Concepts  in  Distributed  Systems.  S.P.LE..  Bellingham.  WA. 
1088. 

2.  Ash.  Robert  B..  Inter-Science  Tracts  in  Pure  and  Applied  Mathematics.  Volume  \P:In  formation 
Theory.  John  Wiley  and  Sons.  New  York.  New  York,  1086 

9.  Barto.  A.  G.  ‘Learning  by  Statistical  Cooperation  of  Self-Interested  Computing  Elements*.  Human 
Neurobiology  4  (1086).  220-268. 

10.  Conte.  Samuel  D..  and  de  Boor,  Carl.  International  Series  in  Pure  and  Applied  Mathematics. 
Volume  .Elementary  Numerical  Analysis,  An  Algorithmic  Approach,  3rd  Ed.  McGraw-HIU  Books.  1080. 

11.  Feller.  William.  An  Introduction  to  Probability  Theory  and  its  Applications  3rd.  Ed.  John  WUey 
and  Sons.  New  York.  New  York.  1088. 

12.  Gallager.  Robert  G..  Information  Theory  and  Reliable  Communication.  John  Wiley  and  Sons.  New 
York.  New  York.  1088. 

12.  Golden.  Richard  M.  Modelling  Causal  Schemata  in  Human  Memory:  A  Connectionist  Approach. 
PbJD.  Tb..  Dept  of  Psychology,  Brown  University.  Providence  Rhode  Island.  1088. 

14.  Golden.  Richard  M  ‘The  *Braln-Staie-ln-a-Box‘  Neural  Model  Is  a  Gradient  Descent  Algorithm*. 
Journal  of  Malhmatical  Psychology  30.  1  (Mareh  1088),  73-80. 

18.  Golden.  Richard  M.  *A  Volfled  Framework  for  Connectionist  Systems*.  Biological  Cybernetics  1-lt 
(January  1088],  .  Recently  submitted  for  publication,  bibliographic  Information  on  this  article  Is 
Incomplete. 


113 


16.  Greene.  Peter  H.  'Superimposed  Random  Coding  of  Stimulus  Response  Connections'  Bulletin  of 
Mathematical  Biophysics  27  (Special  Issue  1965).  191-201. 

17.  Gross.  D  J  .  Mezard.  M.  'The  Simplest  Spin  Glass*.  Nuclear  Physics  BC40,  FSI2  (1984).  431-452. 

18.  Grossberg.  Steven.  Boston  Studies  in  the  Philosophy  of  Science.  Volume  70 .Studies  of  Mind  and 
Brain.  Neural  Principles  of  Learning,  Perception,  Development,  Cognition  and  Motor  Control. 

D.  Reldel  Publishers.  Boston,  Mass.,  1982. 

ID.  Grossberg,  Stephen.  Competlve  Learning:  From  Interactive  Activation  to  Adaptive  Resonance. 
Article  obtained  In  personal  communication,  bibliographic  Information  not  complete. 

JO.  Harris,  Dale  A.  Information  Theory  In  Neuropyslology .  Dept,  of  Physiology,  Harvard  Medical 
School.  Bibliographical  Information  Incomplete. 

21.  Hinton.  GeofTrey  E..  and  Anderson,  James  A..  Parallel  Models  of  Associative  Memory.  Lawrence 
Erlbaum  Associates,  365  Broadway.  Hillsdale,  New  Jersey  07642.  1981. 

22.  Hinton.  Georfrey  E.  'Boltzmann  Machines:  Constraint  Satisfaction  Networks  that  Learn*. 
Cognitive  Science  9  (1986).  147-169. 

23.  Hopfleld.  J.  J.  'Neural  Networks  and  Physical  Systems  with  Emergent  Collective  Computational 
Abilities'.  National  Academy  of  Sciences,  U.S.A.,  Biophysics  7 9  (April  1982),  2554-2558. 

24.  Hopfleld.  J.  J.  'Neurons  with  Graded  Response  have  Collective  Computational  Properties  like  those 
of  Two-state  Neurons*.  Proceedings  of  National  Acadamy  of  Science,  U.S.A.,  Bioptiysics  81  (May 
1984),  3088-3092. 

25.  Hopfleld.  J.J.  and  Tank.  D.W.  •'Neural'  Computation  of  Decisions  In  Optimization  Problems*. 
Biological  Cybernetics  XX  (1985). 

26.  Kanerva.  Penttl.  Self-Propagating  Search:  A  Unified  Theory  of  Memory.  Ph.D.  Th.,  Stanford 
University.  1983. 

27.  Keeler,  James  D.  Capacity  Tor  Patterns  and  Sequences  I"  Kanerva's  SDM  as  Compared  to  Other 
Associative  Memory  Models.  Tech.  Rept.  87.29,  Research  Institute  Tor  Advanced  Computer  Science. 
NASA  Ames  Research  Center.  December.  1987. 

28.  Kohonen.  Tuevo.  Springer  Series  in  Information  Sciences.  Volume  B.Self-Organization  and 
Associative  Memory.  Sprlnger-Verlag,  New  York.  New  York.  1984. 

20.  Lansner.  Anders,  and  Ekeberg,  Or)an.  ’Reliability  and  Speed  of  Recall  In  an  Associative  Network*. 
IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence  PA  Ml-  7,  No.  4  { July  1985).  490-498. 

30.  L'.ndgren.  Bernard  W  .  Statistical  Theory,  3rd  Ed.  MacMillan  Publishing.  New  York,  New  York. 
1976. 

31.  Little.  W  a  'The  Existence  or  Persistent  States  In  the  Brain*.  Mathematical  Biosciences  19 

i  1974  101- 120 

32.  Little.  W  a  .  Sh3w.  Gordon  L.  'Analytic  Study  of  the  Memory  Storage  Capacity  of  a  Neural 
Network*  Mathematical  Biosciences  59(1978),  281-290. 

33.  McEllece.  Robert  J  Encyclopedia  of  Mathematics  and  its  Applications.  Volume  3:77te  Theory  of 
Inf  -rmation  and  Coding  A  i  ilson- Wesley ,  Reading,  Mass.,  1977. 

34.  \1  •E>-»  Rotert  J  .  Posner.  Edward  ,  Rodemlch.  Eugene  R  .  Venkatesh,  Santosh  S.  The  Capacity 
'  f  th*  Hcf  field  \?scc!3!iv»  M»mory  California  Institute  of  Technology .  January  1986  Submitted  to 
IEEE  Transactions  on  Information  Theory. 


in 


35.  Minsky,  Marvin  and  Papert,  Seymour.  Perceptrons,  An  Introduction  to  Computational  Geometry. 
M.I.T.  Press.  Cambridge,  Mass..  1980. 

30.  Parker.  David  B.  Learning  Logic.  Tech.  Rept.  TR-47,  Center  for  Computational  Research  In 
Economics  and  Management  Science,  M.I.T. .  April.  1986. 

37.  Pearimutter.  Barak  A.  and  Hinton  Geoffrey  E.  G-Maximlzatlon:  An  L’nsupervlsed  Learning 
Procedure  for  Discovering  Regularities.  Neural  Networks  for  Computing.  American  Institute  of  Physics. 

1088. 

38.  Rosenblatt.  Frank.  Principles  of  Neurodynamics.  Spartan  Books.  New  York.  New  York.  1082. 

30.  Rudln.  William.  International  Series  in  Pure  and  Applied  Mathematics.  Volume  .Principles  of 
Mathematical  Analysis,  3rd  Ed.  McGraw-Hill.  New  York.  N.Y..  1078. 

40.  Rumelhart.  David  E.,  McClelland.  James  L..  and  the  PDP  Research  Group.  Parallel  Distributed 
Processing,  Explorations  in  the  Microstructure  of  Cognition.  M.I.T.  Press.  Cambridge.  Mass..  1088. 

41.  Schneider.  Walter  and  Mumme.  Dean  C.  Attention.  Automatic  Processing  and  the  Compiling  or 
Knowledge:  A  Two-Level  Architecture  for  Cognition.  To  appear  In  Psychology  Review. 

42.  Sejnowskl.  Terrance  J.  and  Rosenberg.  Charles  R.  NETtalk:  A  Parallel  Network  that  Learns  to 
Read  Aloud.  Tech.  Rept.  JHU/EECS-88/01.  The  Johns  Hopkins  University  Electrical  Engineering  and 
Computer  Science.  1988. 

43.  Shaw,  Gordon  L..  and  Roney.  Kathleen  J.  'Analytic  Solution  of  a  Neural  Network  Theory  Based  on 
an  Islng  Spin  System  Analogy*.  Physics  Letters  7JA.  1.2  (October  1070),  140-160. 

44.  Shlffrln.  Richard  M.  and  Schneider.  Walter.  'Controlled  and  Automatic  Human  Information 
Processing.  II.  Perceptual  Learning,  Automatic  Attending,  and  a  General  Theory'.  Psychological  Review 
34.  2  (1077).  127-189. 

45.  Tanaka.  F..  Edwards.  S.  F.  'Analytic  Theory  or  the  Ground  State  Properties  of  a  Spin  Glass: 

I.  Ising  Spin  Glass*.  J.  Physics  F:  Metal  Physics  10  (1080),  2780-2778. 

40.  Tanaka.  F..  Edwards.  S.  F.  "Analytic  Theory  of  the  Ground  State  Properties  of  a  Spin  Glass:  I.  X  Y 
Spin  Glass*.  J.  Physics  F:  Metal  Physics  10  (1080),  2770-2702. 

47.  Vlterbl.  A.  J.  *On  Coded  Phase-Coherent  Communications*.  IRE  Transactions  on  Space 
Electronics  and  Telemetry  (March  1081),  3-14. 


Using  Rules  and  Task  Division  to  Augment  Connectionist  Learning 


William  L.  Oliver 
and 

Walter  Schneider 


University  of  Pittsburgh 
Learning  Research  and  Development  Center 
3939  O'Hara  St. 

Pittsburgh,  PA  15260 
(412)  624-7496 


This  paper  has  been  submitted  for  publication  in  the  Proceedings  of  the  Tenth  Annual 
Conference  of  the  Cognitive  Science  Society. 


Iv 


Dedication 


This  work  Is  dedicated  to  my  parents  who 
put  up  with  my  twelve-year  college  habit. 


« 


STORAGE  CAPACITY  OF  THE  LINEAR 
ASSOCIATOR:  BEGINNINGS  OF  A  THEORY 
OF  COMPUTATIONAL  MEMORY 


BY 

DEAN  C.  MUMME 

B.S.,  Massachusetts  Institute  of  Technology,  1079 
M.S.,  Idaho  State  University,  1082 


THESIS 


Submitted  in  partial  fulfillment  of  the  requirements 
for  the  degree  of  Doctor  of  Philosophy  In  Computer  Science 
in  the  Graduate  College  of  the 
University  of  Illinois  at  Urbana-Champatgn,  1088 


Urbana,  Dllnols 


115 


Vlt* 

Dean  C.  Mumme  He  received  his  Bachelor 

Science  in  Aeronautics  and  Astronautics  In  1070  from  the  Massachusetts  Institute  of  Technology.  He  then 
studied  Mathematics  for  four  years  at  Idaho  State  University,  earning  a  Master  of  Science  Degree  In 
December.  1082.  After  studying  gradute-level  mathematics  for  an  additional  year,  he  began  working  for 
his  Ph.D.  at  the  University  of  Illinois,  at  Urbana  Champaign  In  August  1083.  During  the  summer  of  1085, 
he  moved  with  his  thesis-advisor  to  Pittsburgh,  Pennsylvania  to  complete  his  thesis  research  at  the 
Learning  Research  and  Development  Center,  University  or  Pittsburgh.  He  Joined  the  University  or  Idaho 
as  Assistant  Professor  In  August  1087  where  be  is  currently  teaching  and  conducting  research  In 
connections  systems. 


